Synchronize push_back and std :: thread

My code

void build(std::vector<RKD <DivisionSpace> >& roots, ...) {
  try {
    // using a local lock_guard to lock mtx guarantees unlocking on destruction / exception:
    std::lock_guard<std::mutex> lck (mtx);
    roots.push_back(RKD<DivisionSpace>(...));
  }
  catch (const std::bad_alloc&) {
    std::cout << "[exception caught when constructing tree]\n";
    return;
  }
}

      

The actual work should now be done serially and not in parallel.

A constructor RKD

can work in parallel with other constructors RKD

. However, pushing objects back into std::Vector

is a critical sector, right?

The number of objects that I am going to build is known. It will be something in the range [2, 16] in practice. In theory, this could be any positive number.

Also, I am not interested that they will be inserted into the container.

So I could do something like:

RKD tree = RKD(...);
mutex_lock(...);
roots.push_back(tree);

      

However, that would mean copying, right?

What should I do to make my code parallel?


I decided to use a lock (not just a mutex) because of this answer.

+3


source to share


2 answers


The proposal that Tomasz Lewowski brings in his comments, and I expanded it, is quite simple and is based on the following observation: A push_back

on the std::vector

potential need to reallocate and the backup copy (or, preferably, moving) items. This represents a critical section that needs to be synchronized.

In the following examples, suppose we want to have a vector filled with the first 12 strokes, but we don't care about ordering them. (I just coded the numbers here, but I'm guessing they come from an expensive computation that makes sense to do in parallel.) In the following scenario, a dangerous race condition exists.

std::vector<int> numbers {};  // an empty vector

// thread A             // thread B             // thread C

numbers.push_back( 2);  numbers.push_back(11);  numbers.push_back(23);
numbers.push_back( 3);  numbers.push_back(13);  numbers.push_back(27);
numbers.push_back( 5);  numbers.push_back(17);  numbers.push_back(29);
numbers.push_back( 7);  numbers.push_back(19);  numbers.push_back(31);

      

There is another problem with push_back

. If two threads call this at the same time, they will both try to build an object at the same index, with potentially disastrous consequences. So the problem is not solved by using reserve(n)

before deploying streams.

However, since you know the number of elements in advance, you can simply assign them to a specific location internally std::vector

without resizing it. If you don't resize, there is no critical section. Therefore, there is no race in the following scenario.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 1] =  3;    numbers[ 2] =  5;
numbers[ 3] =  7;    numbers[ 4] = 11;    numbers[ 5] = 13;
numbers[ 6] = 17;    numbers[ 7] = 19;    numbers[ 8] = 23;
numbers[ 9] = 29;    numbers[10] = 31;    numbers[11] = 37;

      

Of course, if both threads try to write the same index, the race will reappear. Fortunately, protection against this is not difficult in practice. If your vector has n elements , and you have p streams , stream i only writes to elements [ i n / p , ( i + 1) n / p ). Note that this is preferable if stream i is written to elements with index j only if j mod p = ibecause it results in fewer invalid caches. So the access pattern in the above example is suboptimal and looks better like this.

std::vector<int> numbers(12);  // 12 elements initialized with 0

// thread A          // thread B          // thread C

numbers[ 0] =  2;    numbers[ 4] = 11;    numbers[ 8] = 23;
numbers[ 1] =  3;    numbers[ 5] = 13;    numbers[ 9] = 29;
numbers[ 2] =  5;    numbers[ 6] = 17;    numbers[10] = 31;
numbers[ 3] =  7;    numbers[ 7] = 19;    numbers[11] = 37;

      

So far so good. But what if you haven't std::vector<int>

, but a std::vector<Foo>

? If it Foo

doesn't have a default constructor, then

std::vector<Foo> numbers(10);

      

will be invalid. And even if there was one, it would be outrageous to create a lot of expensive default objects just to reassign them soon without getting a value.

Of course, most well thought out classes should have a very cheap default constructor. For example, it std::string

is configured by default for an empty string that does not require memory allocation. A good implementation will reduce the default cost - will only build a string

std::memset(this, 0, sizeof(std::string));

      

And if the compiler is smart enough to understand that we are allocating and initializing the whole std::vector<std::string>(n)

thing, it can optimize this further down to one call



std::calloc(n, sizeof(std::string));

      

So, if there's any chance that you can make the Foo

default cheap, constructive, and assignable, you're done. However, if this proves to be difficult, you can avoid the problem by moving it to the heap. Smart pointer is cheap by default, so

std::vector<std::unique_ptr<Foo>> foos(n);

      

ultimately boils down to

std::calloc(n, sizeof(std::unique_ptr<Foo>));

      

if you don’t do anything for Foo

. Of course, this convenience comes at the cost of dynamically allocating memory for each item.

std::vector<std::unique_ptr<Foo>> foos(n);

// thread A                    // thread B                           // thread C

foos[0].reset(new Foo {...});  foos[n / 3 + 0].reset(new Foo {...});  foos[2 * n / 3 + 0].reset(new Foo {...});
foos[1].reset(new Foo {...});  foos[n / 3 + 1].reset(new Foo {...});  foos[2 * n / 3 + 1].reset(new Foo {...});
foos[2].reset(new Foo {...});  foos[n / 3 + 2].reset(new Foo {...});  foos[2 * n / 3 + 2].reset(new Foo {...});
...                            ...                                    ...

      

This may not be as bad as you think, because while dynamic memory allocations are not free, sizeof

a is std::unique_ptr

very small, so if sizeof(Foo)

large, you get the bonus of a more compact vector it is faster than iteration. It all depends on how you are going to use your data.

If you don't know the exact number of elements in advance, or are afraid that you will mess up the indexing, there is another way to do it: each thread fills its own vector and concatenates them at the end. Continuing with the primes example, we get this.

std::vector<int> numbersA {};  // private store for thread A
std::vector<int> numbersB {};  // private store for thread B
std::vector<int> numbersC {};  // private store for thread C

// thread A              // thread B              // thread C

numbersA.push_back( 2);  numbersB.push_back(11);  numbersC.push_back(23);
numbersA.push_back( 3);  numbersB.push_back(13);  numbersC.push_back(27);
numbersA.push_back( 5);  numbersB.push_back(17);  numbersC.push_back(29);
numbersA.push_back( 7);  numbersB.push_back(21);  numbersC.push_back(31);

// Back on the main thread after A, B and C are joined:

std::vector<int> numbers(
    numbersA.size() + numbersB.size() + numbersC.size());
auto pos = numbers.begin();
pos = std::move(numbersA.begin(), numbersA.end(), pos);
pos = std::move(numbersB.begin(), numbersB.end(), pos);
pos = std::move(numbersC.begin(), numbersC.end(), pos);
assert(pos == numbers.end());

// Now dispose of numbersA, numbersB and numbersC as soon as possible
// in order to release their no longer needed memory.

      

( std::move

used in the above code is the one from the Algorithm Library .)

This approach has the most desirable memory access pattern since numbersA

, numbersB

and numbersC

write completely independent allocated memory. Of course, we have to do additional consistent work to combine the intermediate results. Note that efficiency largely depends on the fact that the cost of moving the item element is negligible compared to the cost of searching / creating. At least as stated above, the code also assumes that your type has a cheap default constructor. Of course, if this is not the case for your type, you can use smart pointers again.

Hopefully this has provided you with enough ideas to optimize your problem.

If you have never used smart pointers before, take a look at "RAII and Smart Pointers in C ++" and check out the standard library dynamic memory management library . Of course, the methods shown above will also work with std::vector<Foo *>

, but we no longer use a resource holding raw pointers like we do in modern C ++.

+7


source


The problem is that your constructor is doing a lot of work, and this breaks all sorts of library conventions when building and inserting containers.

Just fix this by decoupling the insert from the creation.

The code below is very similar to the code suggested by @ 5gon12eder, except that it does not "force" you to reposition the object.

In my little demo

  • we use a raw memory area that is completely uninitialized (this is not possible with a vector, where insertion implies initialization), so instead of the "canonical"

    std::array<RKD, 500> rkd_buffer; // OR
    std::vector<RKD> rkd_buffer(500); // OR even
    std::unique_ptr<RKD[]> rkd_buffer(new RKD[500]);
    
          

    we're going to use a custom combination:

    std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer(
        static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)),
        ::free
    );
    
          

  • we create multiple threads (5 in the sample) to build all the elements. The elements are just built in place and their corresponding destructors will be called when the program exits

    therefore
  • it is important that all elements are fully initialized before going rkd_buffer

    out of scope ( join

    provides this here).
  • threads can be synchronized in various ways: constructs can, for example, be sent through a work queue to a thread pool, where any condition variables, promises, thread barriers (from promotion), or even just atomized counters can be used for coordination.

    All of these options are essentially unrelated to the task of creating a parallel build, so I'll leave that to your imagination (or other SO answers)



Live On Coliru

struct RKD {
    RKD() { this_thread::sleep_for(chrono::milliseconds(rand() % 100)); } // expensive
};

int main() {
    static const int N         = 500;
    static const int ChunkSize = 100;
    std::unique_ptr<RKD[N], decltype(&::free)> rkd_buffer(static_cast<RKD(*)[N]>(::malloc(sizeof(RKD) * N)), ::free);

    vector<thread> group;
    for (int chunk = 0; chunk < N/ChunkSize; chunk += ChunkSize)
        group.emplace_back([&] { 
            for (int i=chunk * ChunkSize; i<(ChunkSize + chunk*ChunkSize); ++i)
                new (rkd_buffer.get() + i) RKD;
        });

    for (auto& t:group) if (t.joinable()) t.join();

    // we are responsible for destructing, since we also took responsibility for construction
    for (RKD& v : *rkd_buffer)
        v.~RKD();
}

      

You can see that there are 5 threads sharing 500 constructs. Each design takes (on average) ~ 50ms, so the total time should be 100 * 50ms ~ 5s. In fact, this is exactly how it happens:

real    0m5.193s
user    0m0.004s
sys 0m0.000s

      

+3


source







All Articles