Do I need to worry about the number of tasks I create?

I tried looking for something on the internet regarding this, but there doesn't seem to be a definite answer. I just have my own reasoning and would like to know which is the best way.

My application runs through a long list of files (about 100-200) and does some calculations on internal data. It takes a few minutes for each file.

I originally planned to create tasks based on the number of cores in the processor.

So, if there are 4 cores, then I would create 3 tasks and each of them processed 1/3 of the files.

My reading said that a thread pool manages the entire task and according to it creates threads for it based on many factors (in simple terms?)

Would it be better for me to just create a task for each file and let the thread pool decide which is best?

Any information, suggestion would be very welcome! Thanks to

EDIT: All files are around 5MB and calculating / analyzing the data in the files is very heavy.

+3


source to share


2 answers


200 files isn't such a long list, but I still recommend not flooding the ThreadPool with pending tasks.

For this you can use ActionBlock for TPL data stream. You create a block, give it an action for each item, and restrict the parallelism to whatever you want.

Example in C #:



var block = new ActionBlock<string>(async fileName =>
{
    var data = await ReadFileAsync(fileName);
    ProcessData(data);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });

foreach (var fileName in fileNames)
{
    block.Post(fileName);
}

block.Complete();
await block.Completion;

      

Since this is not only a processor bound operation, you must use more than the available CPUs. Consider using a config file so that you can modify it based on actual performance.

+2


source


based on many factors

This is the key point. It is unpredictable (to me) how many threads will actually be started to run without CPU tethering under full load. The .NET thread pool heuristic is highly volatile (subjective: insane) and cannot be relied upon.

allow thread pool to decide which is best

He does not know. This is (mostly) helpful when scheduling CPU work, but cannot find the optimal degree of parallelism to work with IO bindings.



Use PLINQ:

myFiles
.AsParallel().WithDOP(optimalDopHere)
.ForAll(x => Process(x));

      

Determine the optimal degree of parallelism empirically.

If it's purely CPU-bound, you can avoid almost any parallel construct, maybe Parallel

PLINQ or else.

+2


source







All Articles