Is the thread pool running sequentially?
I am writing a function that gets all the files in a directory, but does so in parallel, adding each subdirectory to the thread pool. I thought this would mean that each directory will traverse in parallel, and since there are many subdirectories, this will be done much faster than doing it sequentially. My code looks like this:
private object addlock = new object();
private void addFiles(string[] newFiles)
{
lock (addlock) {
files.AddRange( newFiles );
Console.WriteLine( "Added {0} files", newFiles.Length );
}
}
private void getFilesParallel(string dir)
{
if (!Directory.Exists( dir )) {
return;
}
string[] dirs = Directory.GetDirectories( dir, "*", SearchOption.TopDirectoryOnly );
ManualResetEvent mre = new ManualResetEvent( false );
ThreadPool.QueueUserWorkItem( (object obj) =>
{
addFiles( Directory.GetFiles( dir, "*", SearchOption.TopDirectoryOnly ) );
mre.Set();
} );
Process currentProcess = Process.GetCurrentProcess();
long memorySize = currentProcess.PrivateMemorySize64;
Console.WriteLine( "Used {0}", memorySize );
foreach (string str in dirs) {
getFilesParallel( str );
}
mre.WaitOne();
}
The problem is I am getting output like this:
Added 34510 files
Used 301420544
Added 41051 files
Used 313937920
Added 39093 files
Used 322764800
Added 44426 files
Used 342536192
Added 30772 files
Used 350728192
Added 36262 files
Used 360329216
Added 31686 files
Used 368685056
Added 33194 files
Used 374894592
Added 34486 files
Used 384057344
Added 37298 files
Used 393998336
This suggests that my code is running sequentially, as I would expect to find each statement in clumps as they run on different threads. I ran it multiple times using different folders and the result is always the same. Why is this done sequentially?
source to share
You only have one physical drive. The disc head can only be in one location at a time. The fact that you are asking him for two pieces of information at the same time prevents it from actually being in two places at the same time.
There is a small part of the CPU work in your program that can actually be parallelized, but this is not the main bottleneck.
If you had multiple physical disk drives and data on each disk, you can access the data on each of them in parallel and actually do this work in parallel.
source to share
It's a little tricky to gauge accurately, because if you have enough memory, the first run will cache the data, and subsequent enumerations of the same folder may run without access to the disk at all.
It is also worth considering that if you have an SSD it will benefit more from parallel operations as it supports a lot more IOPS because it has no moving parts to wait.
This code shows that in my quad core i5, parallel can be 2 to 3 times faster than single parallel when working with an SSD or when data is already cached.
This demonstrates the use of Parallel.ForEach which can hurt from the parallelism task.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
namespace FilesReader
{
class Program
{
static void Main(string[] args)
{
string path = args[0];
RunTrial(path, false);
RunTrial(path, true);
}
private static void RunTrial(string path, bool useParallel)
{
Console.WriteLine("Parallel: {0}", useParallel);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
FileListing listing = new FileListing(path, useParallel);
stopwatch.Stop();
Console.WriteLine("Added {0} files in {1} ms ({2} file/second)",
listing.Files.Count, stopwatch.ElapsedMilliseconds,
(listing.Files.Count * 1000 / stopwatch.ElapsedMilliseconds));
}
}
class FileListing
{
private ConcurrentList<string> _files;
private bool _parallelExecution;
public FileListing(string path, bool parallelExecution)
{
_parallelExecution = parallelExecution;
_files = new ConcurrentList<string>();
BuildListing(path);
}
public ConcurrentList<string> Files
{
get { return _files; }
}
private void BuildListing(string path)
{
string[] dirs = null;
string[] files = null;
bool success = false;
try
{
dirs = Directory.GetDirectories(path, "*", SearchOption.TopDirectoryOnly);
files = Directory.GetFiles(path);
success = true;
}
catch (SystemException) { /* Suppress security exceptions etc*/ }
if (success)
{
Files.AddRange(files);
if (_parallelExecution)
{
Parallel.ForEach(dirs, d => BuildListing(d));
}
else
{
foreach (string dir in dirs)
{
BuildListing(dir);
}
}
}
}
}
class ConcurrentList<T>
{
object lockObject = new object();
List<T> list;
public ConcurrentList()
{
list = new List<T>();
}
public void Add(T item)
{
lock (lockObject) list.Add(item);
}
public void AddRange(IEnumerable<T> collection)
{
lock (lockObject) list.AddRange(collection);
}
public long Count
{
get { lock (lockObject) return list.Count; }
}
}
}
I considered using Concurrent collections instead of pumping over the thread-safe list implementation, but they turned out to be about 5% slower.
source to share