Concurrency :: parallel_for (PPL) creates too many threads

I am using the Concurrency::parallel_for()

Visual Studio 2010 Parallel Template Libraries (PPL) to handle an indexed set of tasks (usually a set of indexes is much larger than the number of threads that can run concurrently). Each task, before doing a lengthy calculation, starts by requesting a separate working storage resource from the shared resource manager (in case: the file view is tied to specific tasks, but I think the storyline will be the same if each task requested a private memory allocation from the shared heap).

The use of the shared resource manager syncs with Concurrency::critical_section

, and this is where the problem arises: if the first thread / task is in a critical section and the second task is executing the request, it has to wait while the first task is processing the request. Obviously the PPL is thinking: hey this thread is waiting and there are more tasks, so another thread is created calling up to 870 threads that are basically waiting for the same resource manager.

Now that handling the resource request is only a small part of the whole task, I would like to tell the PPL in this part to keep my horses, none of the pending or co-blocking blocks should force new threads to start from the specified section of the workflow, and my question is here : if and how I can prevent a particular thread section from creating new threads, even if it blocks together . I would not mind creating new threads on other blocks further down the thread processing path, but no more than 2 * the number of (hyper) cores.

The alternatives I've looked at so far:

  • Task queue and queue processing from a limited number of threads. Problem. I was hoping PPL parallel_for would do it himself.

  • Define Concurrency::combinable<Resource> resourceSet

    ; outside Concurrency::parallel_for

    and initialize resourceSet.local()

    once to reduce the number of resource requests (by reusing resources) to the number of threads (which should be less than the number of tasks). Problem: This optimization does not prevent the creation of an extra stream.

  • Preallocate the necessary resources for each task outside the loop parallel_for

    . Problem: This will take up too many system resources, whereas limiting the number of resources to the number of threads / cores would be okay (if it didn't blow up).

I read http://msdn.microsoft.com/en-us/library/ff601930.aspx , section "Don't do multiple repetitions in a parallel loop", but following the advice, there won't be any parallel threads at all.

+3


source to share


2 answers


I don't know if PPL / ConcRT can be configured to not use shared sync, or at least put a limit on the number of threads it creates. I thought it could be controlled with the scheduler policy , but it seems that none of the policy settings are appropriate for this purpose.

However, I have some suggestions that you might find helpful to mitigate the problem, even if this is not an ideal way:



  • Instead, critical_section

    use an incompatible synchronization primitive to secure the resource manager. I think (though not tested) that classic WinAPIcritical_section

    should do well. As a radical step in this direction, you can consider other parallel libraries for your code; for example Intel TBB provides most of the PPL API and has more (disclaimer: I linked to it).

  • Preallocate multiple resources outside of the parallel loop. One resource for each task is not required; one per stream should be sufficient. Put these resources in concurrent_queue

    , and inside the task, pull the resource from the queue, use it, and then push it back. Also, instead of returning the resource to the queue, a thread can accumulate it inside an object combinable

    for reuse in other tasks. If the queue turns out to be empty (for example, if the PPL is overwriting the device) there could be different approaches, for example spinning in a loop until some other thread returns a resource or requests another resource from the manager. You can also preallocate more resources than the number of threads to minimize the chances of running out of resources.

+3


source


My answer is not a "solution" using PPL, but I think you can do it that easily with a thread pool like taskqueue , you should take a look at this answer .



This way you fill the queue with your jobs, this ensures that no more than "x" tasks will be executed that run in parallel, where x is boost::thread::hardware_concurrency()

(yes, it goes up again ...)

+1


source







All Articles