How to design threads for many short tasks

I want to use multiple threads to speed up my program, but not sure which path is optimal.

Let's say we have 10,000 small tasks, one of them only takes 0.1s to complete. I now have a processor with 12 cores and I want to use 12 threads to make it faster.

As far as I know there are two ways:

1.Tasks package

There are always 12 threads, each of which receives one new task from the task pool after the current work has finished.

2. Separate tasks

By dividing 10,000 jobs into 12 parts and each strand working from one side.

The problem is, if I use the task pool, it is a waste of time to lock / unlock when multiple threads are trying to access the task pool. But the second way is not ideal, because some of the threads finish earlier, the total time depends on the slowest thread.

I am wondering how do you handle this kind of work and any other better way to do it? Thank.

EDIT: Note that the number 10000, for example in practice, can range from 1 to 8 or more tasks, and 0.1 per task is also the average time.

EDIT2: Thanks for all your answers:] It's good to know the kinds of options.

+3


source to share


5 answers


Both methods suggested in the question will work well and equally with each other (in simple cases with predictable and relatively long task durations). If the type of target system is known and available (and if performance is indeed a major concern), this approach should be chosen based on prototypes and measurements.

You don't have to prejudice yourself about the optimal number of threads to match the number of cores. If it's a regular server or desktop system, different system processes will spawn here and thereafter, and you might see your 12 threads floating differently between the processors, which corrupts memory caching.

There are also important non-measurement factors that you should check: Are these small tasks doing any resources to accomplish? Do these resources provide additional potential latency (blocking) or contention? Are there additional applications competing for processor power? Will the application expand to accommodate different runtimes, task types, or user experience models?



If the answer is no, here are some additional approaches you can measure and consider.

  • Use only 10 or 11 threads. You will notice a slight slowdown or even a slight speedup (the extra core will serve the OS processes, so the rest of the thread affinity will become more stable compared to 12 threads). Any concurrent interactive activity in the system will see a big boost in response.

  • Create exactly 12 threads, but explicitly set a different processor affinity mask to each to impose a 1-1 mapping between threads and processors. This is good in the simplest, almost academic case where no resources other than the CPU and shared memory are involved; you won't see the chronic migration of threads across processes. The disadvantage is the algorithm that is closely related to a specific machine; on another machine, it might behave so badly that it never ends (due to an unrelated real-time task that blocks one of your threads forever).

  • Create 12 threads and divide tasks evenly. There are each thread to downgrade its own priority when it reaches 40%, and again when it is minus 80% of its load. This will improve load balancing within your process, but it will behave badly if your application is competing with other CPU-bound processes.

+2


source


Thus, halfway between the two approaches, you need to split 100 batches of 100 tasks each, and let the kernel pick a batch of 100 tasks at a time from the task pool.

Perhaps if you simulate runtime randomness in a single core for a single task and get an estimate of the mutex lock time, you can find the optimal batch size.



But without much work we at least have the following lemma:

The slowest thread can only take max 100 * .1 = 10s more than others.

+4


source


The task pool is always the best solution here. It's not just optimal timing, it's also code comprehensibility. You should never force your tasks to meet completely unrelated criteria for having the same number of subtasks as kernels - your tasks have nothing to do with this (at all), and this separation does not scale with machine changes, etc. This requires the overhead of working collaboratively to combine results in subtasks for the final task, and it usually makes the task easier.

But you shouldn't worry about using blocking for the taskpool. Blocking queues availableif you've ever identified them. But first figure it out. If you're worried about time, use appropriate methods to speed up your task and do your best to get the most benefit. Profile your code. Why do your tasks take 0.1 seconds? Are they using an inefficient algorithm? Can help be minimized? If you discover hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find that everything is working as quickly as possible and you want an extra second from removing locks, search the web with your favorite search engine for "lockfree queue" and "waitfree queue". Comparison and exchange makes atomic lists easy.

+2


source


100ms / task - pile 'em as they are - pool overhead will be negligible.

OTOH ..

1E8 tasks @ 0.1s / task = 10,000,000 seconds = 2777.7 hours = 115.7 days

This is much longer than the interval between reboots from Tuesday to Tuesday.

Even if you run this on Linux, you must release and flush it to disk in such a way that the job is rerunable.

Is there a database? If so, you should have told us!

+1


source


Each worker thread can have its own small task queue with no more than one or two pages of memory. When the queue size gets low (half capacity), it has to send a signal to the manager thread to fill it with more tasks. If the queue is organized in batches, then worker threads should not enter critical sections unless the current batch is empty. By avoiding critical sections, you get extra cycles for the actual work. Two batches are enough for one queue, in which case one batch can take one memory page, so the queue takes two.

Memory page pages are something that a thread doesn't need to jump all over memory to get data. If all the data is in one place (one page of memory), you avoid cache misses.

+1


source







All Articles