Do I need to worry about the number of tasks I create?
I tried looking for something on the internet regarding this, but there doesn't seem to be a definite answer. I just have my own reasoning and would like to know which is the best way.
My application runs through a long list of files (about 100-200) and does some calculations on internal data. It takes a few minutes for each file.
I originally planned to create tasks based on the number of cores in the processor.
So, if there are 4 cores, then I would create 3 tasks and each of them processed 1/3 of the files.
My reading said that a thread pool manages the entire task and according to it creates threads for it based on many factors (in simple terms?)
Would it be better for me to just create a task for each file and let the thread pool decide which is best?
Any information, suggestion would be very welcome! Thanks to
EDIT: All files are around 5MB and calculating / analyzing the data in the files is very heavy.
source to share
200 files isn't such a long list, but I still recommend not flooding the ThreadPool with pending tasks.
For this you can use ActionBlock for TPL data stream. You create a block, give it an action for each item, and restrict the parallelism to whatever you want.
Example in C #:
var block = new ActionBlock<string>(async fileName =>
{
var data = await ReadFileAsync(fileName);
ProcessData(data);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });
foreach (var fileName in fileNames)
{
block.Post(fileName);
}
block.Complete();
await block.Completion;
Since this is not only a processor bound operation, you must use more than the available CPUs. Consider using a config file so that you can modify it based on actual performance.
source to share
based on many factors
This is the key point. It is unpredictable (to me) how many threads will actually be started to run without CPU tethering under full load. The .NET thread pool heuristic is highly volatile (subjective: insane) and cannot be relied upon.
allow thread pool to decide which is best
He does not know. This is (mostly) helpful when scheduling CPU work, but cannot find the optimal degree of parallelism to work with IO bindings.
Use PLINQ:
myFiles
.AsParallel().WithDOP(optimalDopHere)
.ForAll(x => Process(x));
Determine the optimal degree of parallelism empirically.
If it's purely CPU-bound, you can avoid almost any parallel construct, maybe Parallel
PLINQ or else.
source to share