Best use of Parallel.ForEach / multithreading
I need to clear data from a website. I have over 1000 links that I need to access and previously I split 10 links per thread and started 100 threads each pulling 10. After a few test cases, 100 threads was the best count to minimize the time it got content for all links.
I realized that .NET 4.0 offers better multithreading support out of the box, but this is done based on how many cores you have, which in my case doesn't spawn enough threads. I guess what I am asking is: what is the best way to optimize the progress of line 1000. Should I use .ForEach
and let the expansion Parallel
control the number of threads that are spawned, or find a way to tell how many threads should start and split the work?
I haven't worked with Parallel
before, maybe my approach might be wrong.
source to share
Something worth checking out is the TPL dataflow library.
DataFlow on MSDN.
See Nesting in Parallel.ForEach
The whole idea of ββParallel.ForEach () is that you have a set of threads and each process is processing a part of the collection. As you noticed, this does not work with async-await, where you want to release the thread while the asynchronous call is being called.
In addition, the Create Data Stream walkthrough customizes and handles multiple web page loads. TPL Dataflow was indeed designed for this scenario.
source to share
you can use the MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be created.
Here is a piece of code -
ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;
Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);
source to share
In general, it Parallel.ForEach()
optimizes the number of threads quite well. It takes into account the number of cores on the system, but it also takes into account what the threads are doing (CPU bound, IO bound, method duration, etc.).
You can control the maximum degree of parallelism, but there is no mechanism to increase the number of threads.
Make sure your benchmarks are correct and can be compared fairly (for example, on the same websites, allowing a warm-up period before you start measuring and doing many runs as responses on response times can be quite high scraping sites). If, after careful measurement, your own flow code is even faster, you can conclude that you optimized for your particular case better than .NET and stick with your own code.
source to share