Synchronize push_back and std :: thread
My code
void build(std::vector<RKD <DivisionSpace> >& roots, ...) {
try {
// using a local lock_guard to lock mtx guarantees unlocking on destruction / exception:
std::lock_guard<std::mutex> lck (mtx);
roots.push_back(RKD<DivisionSpace>(...));
}
catch (const std::bad_alloc&) {
std::cout << "[exception caught when constructing tree]\n";
return;
}
}
The actual work should now be done serially and not in parallel.
A constructor RKD
can work in parallel with other constructors RKD
. However, pushing objects back into std::Vector
is a critical sector, right?
The number of objects that I am going to build is known. It will be something in the range [2, 16] in practice. In theory, this could be any positive number.
Also, I am not interested that they will be inserted into the container.
So I could do something like:
RKD tree = RKD(...);
mutex_lock(...);
roots.push_back(tree);
However, that would mean copying, right?
What should I do to make my code parallel?
I decided to use a lock (not just a mutex) because of this answer.
source to share
push_back
on the
std::vector
potential need to reallocate and the backup copy (or, preferably, moving) items. This represents a critical section that needs to be synchronized.
In the following examples, suppose we want to have a vector filled with the first 12 strokes, but we don't care about ordering them. (I just coded the numbers here, but I'm guessing they come from an expensive computation that makes sense to do in parallel.) In the following scenario, a dangerous race condition exists.
std::vector<int> numbers {}; // an empty vector
// thread A // thread B // thread C
numbers.push_back( 2); numbers.push_back(11); numbers.push_back(23);
numbers.push_back( 3); numbers.push_back(13); numbers.push_back(27);
numbers.push_back( 5); numbers.push_back(17); numbers.push_back(29);
numbers.push_back( 7); numbers.push_back(19); numbers.push_back(31);
There is another problem with push_back
. If two threads call this at the same time, they will both try to build an object at the same index, with potentially disastrous consequences. So the problem is not solved by using reserve(n)
before deploying streams.
However, since you know the number of elements in advance, you can simply assign them to a specific location internally std::vector
without resizing it. If you don't resize, there is no critical section. Therefore, there is no race in the following scenario.
std::vector<int> numbers(12); // 12 elements initialized with 0
// thread A // thread B // thread C
numbers[ 0] = 2; numbers[ 1] = 3; numbers[ 2] = 5;
numbers[ 3] = 7; numbers[ 4] = 11; numbers[ 5] = 13;
numbers[ 6] = 17; numbers[ 7] = 19; numbers[ 8] = 23;
numbers[ 9] = 29; numbers[10] = 31; numbers[11] = 37;
Of course, if both threads try to write the same index, the race will reappear. Fortunately, protection against this is not difficult in practice. If your vector has n elements , and you have p streams , stream i only writes to elements [ i n / p , ( i + 1) n / p ). Note that this is preferable if stream i is written to elements with index j only if j mod p = ibecause it results in fewer invalid caches. So the access pattern in the above example is suboptimal and looks better like this.
std::vector<int> numbers(12); // 12 elements initialized with 0
// thread A // thread B // thread C
numbers[ 0] = 2; numbers[ 4] = 11; numbers[ 8] = 23;
numbers[ 1] = 3; numbers[ 5] = 13; numbers[ 9] = 29;
numbers[ 2] = 5; numbers[ 6] = 17; numbers[10] = 31;
numbers[ 3] = 7; numbers[ 7] = 19; numbers[11] = 37;
So far so good. But what if you haven't std::vector<int>
, but a std::vector<Foo>
? If it Foo
doesn't have a default constructor, then
std::vector<Foo> numbers(10);
will be invalid. And even if there was one, it would be outrageous to create a lot of expensive default objects just to reassign them soon without getting a value.
Of course, most well thought out classes should have a very cheap default constructor. For example, it std::string
is configured by default for an empty string that does not require memory allocation. A good implementation will reduce the default cost - will only build a string
std::memset(this, 0, sizeof(std::string));
And if the compiler is smart enough to understand that we are allocating and initializing the whole std::vector<std::string>(n)
thing, it can optimize this further down to one call
std::calloc(n, sizeof(std::string));
So, if there's any chance that you can make the Foo
default cheap, constructive, and assignable, you're done. However, if this proves to be difficult, you can avoid the problem by moving it to the heap. Smart pointer is cheap by default, so
std::vector<std::unique_ptr<Foo>> foos(n);
ultimately boils down to
std::calloc(n, sizeof(std::unique_ptr<Foo>));
if you donβt do anything for Foo
. Of course, this convenience comes at the cost of dynamically allocating memory for each item.
std::vector<std::unique_ptr<Foo>> foos(n);
// thread A // thread B // thread C
foos[0].reset(new Foo {...}); foos[n / 3 + 0].reset(new Foo {...}); foos[2 * n / 3 + 0].reset(new Foo {...});
foos[1].reset(new Foo {...}); foos[n / 3 + 1].reset(new Foo {...}); foos[2 * n / 3 + 1].reset(new Foo {...});
foos[2].reset(new Foo {...}); foos[n / 3 + 2].reset(new Foo {...}); foos[2 * n / 3 + 2].reset(new Foo {...});
... ... ...
This may not be as bad as you think, because while dynamic memory allocations are not free, sizeof
a is std::unique_ptr
very small, so if sizeof(Foo)
large, you get the bonus of a more compact vector it is faster than iteration. It all depends on how you are going to use your data.
If you don't know the exact number of elements in advance, or are afraid that you will mess up the indexing, there is another way to do it: each thread fills its own vector and concatenates them at the end. Continuing with the primes example, we get this.
std::vector<int> numbersA {}; // private store for thread A
std::vector<int> numbersB {}; // private store for thread B
std::vector<int> numbersC {}; // private store for thread C
// thread A // thread B // thread C
numbersA.push_back( 2); numbersB.push_back(11); numbersC.push_back(23);
numbersA.push_back( 3); numbersB.push_back(13); numbersC.push_back(27);
numbersA.push_back( 5); numbersB.push_back(17); numbersC.push_back(29);
numbersA.push_back( 7); numbersB.push_back(21); numbersC.push_back(31);
// Back on the main thread after A, B and C are joined:
std::vector<int> numbers(
numbersA.size() + numbersB.size() + numbersC.size());
auto pos = numbers.begin();
pos = std::move(numbersA.begin(), numbersA.end(), pos);
pos = std::move(numbersB.begin(), numbersB.end(), pos);
pos = std::move(numbersC.begin(), numbersC.end(), pos);
assert(pos == numbers.end());
// Now dispose of numbersA, numbersB and numbersC as soon as possible
// in order to release their no longer needed memory.
( std::move
used in the above code is the one from the Algorithm Library .)
This approach has the most desirable memory access pattern since numbersA
, numbersB
and numbersC
write completely independent allocated memory. Of course, we have to do additional consistent work to combine the intermediate results. Note that efficiency largely depends on the fact that the cost of moving the item element is negligible compared to the cost of searching / creating. At least as stated above, the code also assumes that your type has a cheap default constructor. Of course, if this is not the case for your type, you can use smart pointers again.
Hopefully this has provided you with enough ideas to optimize your problem.
If you have never used smart pointers before, take a look at "RAII and Smart Pointers in C ++" and check out the standard library dynamic memory management library . Of course, the methods shown above will also work with std::vector<Foo *>
, but we no longer use a resource holding raw pointers like we do in modern C ++.
source to share
The problem is that your constructor is doing a lot of work, and this breaks all sorts of library conventions when building and inserting containers.
Just fix this by decoupling the insert from the creation.
The code below is very similar to the code suggested by @ 5gon12eder, except that it does not "force" you to reposition the object.
In my little demo
-
we use a raw memory area that is completely uninitialized (this is not possible with a vector, where insertion implies initialization), so instead of the "canonical"
std::array<RKD, 500> rkd_buffer; // OR std::vector<RKD> rkd_buffer(500); // OR even std::unique_ptr<RKD[]> rkd_buffer(new RKD[500]);
we're going to use a custom combination:
std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer( static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)), ::free );
-
we create multiple threads (5 in the sample) to build all the elements. The elements are just built in place and their corresponding destructors will be called when the program exits
therefore - it is important that all elements are fully initialized before going
rkd_buffer
out of scope (join
provides this here). -
threads can be synchronized in various ways: constructs can, for example, be sent through a work queue to a thread pool, where any condition variables, promises, thread barriers (from promotion), or even just atomized counters can be used for coordination.
All of these options are essentially unrelated to the task of creating a parallel build, so I'll leave that to your imagination (or other SO answers)
struct RKD {
RKD() { this_thread::sleep_for(chrono::milliseconds(rand() % 100)); } // expensive
};
int main() {
static const int N = 500;
static const int ChunkSize = 100;
std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer(static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)), ::free);
vector<thread> group;
for (int chunk = 0; chunk < N/ChunkSize; chunk += ChunkSize)
group.emplace_back([&] {
for (int i=chunk * ChunkSize; i<(ChunkSize + chunk*ChunkSize); ++i)
new (rkd_buffer.get() + i) RKD;
});
for (auto& t:group) if (t.joinable()) t.join();
// we are responsible for destructing, since we also took responsibility for construction
for (RKD& v : *rkd_buffer)
v.~RKD();
}
You can see that there are 5 threads sharing 500 constructs. Each design takes (on average) ~ 50ms, so the total time should be 100 * 50ms ~ 5s. In fact, this is exactly how it happens:
real 0m5.193s
user 0m0.004s
sys 0m0.000s
source to share