Parallel file system scan

I want to get file information (file name and size in bytes) for files in a directory. But there are many subdirectories (~ 1000) and files (~ 40,000).

Actually my solution is to use filepath.Walk () to get the file information for each file. But this is quite long.

func visit(path string, f os.FileInfo, err error) error {
    if f.Mode().IsRegular() {
        fmt.Printf("Visited: %s File name: %s Size: %d bytes\n", path, f.Name(), f.Size())
    }
    return nil
}
func main() {
    flag.Parse()
    root := "C:/Users/HERNOUX-06523/go/src/boilerpipe" //flag.Arg(0)
    filepath.Walk(root, visit)
}

      

Can I do parallel / parallel processing using filepath.Walk () ?

+3


source to share


1 answer


You can do parallel processing by changing your function visit()

to not go into subfolders, but start a new goroutine for each subfolder.

To do this, return a custom error filepath.SkipDir

from your function visit()

if the entry is a directory. Don't forget to check if path

inside is a visit()

subfolder that the goroutine should handle, because it is also passed to visit()

, and without this check, you run goroutines endlessly for the starting folder.

Also you will need some kind of "counter" of how many goroutines are still running in the background, for that you can use sync.WaitGroup

.

Here's a simple implementation of this:

var wg sync.WaitGroup

func walkDir(dir string) {
    defer wg.Done()

    visit := func(path string, f os.FileInfo, err error) error {
        if f.IsDir() && path != dir {
            wg.Add(1)
            go walkDir(path)
            return filepath.SkipDir
        }
        if f.Mode().IsRegular() {
            fmt.Printf("Visited: %s File name: %s Size: %d bytes\n",
                path, f.Name(), f.Size())
        }
        return nil
    }

    filepath.Walk(dir, visit)
}

func main() {
    flag.Parse()
    root := "folder/to/walk" //flag.Arg(0)

    wg.Add(1)
    walkDir(root)
    wg.Wait()
}

      

Some notes:



Depending on the "distribution" of files between subfolders, this may not fully use your processor / storage, as if, for example, 99% of all files were in one subfolder, then goroutine will still take up most of the time.

Also note that the calls are fmt.Printf()

serialized, which also slows down the process. I assume this was just an example and you will actually be doing some sort of processing / statistics in memory. Remember to also protect concurrent access to variables accessible from your function visit()

.

Don't worry about a lot of subfolders. This is fine, and the Go runtime can handle even hundreds of thousands of goroutines.

Also note that your storage / hard drive speed will most likely be the performance bottleneck, so you may not get the performance you want. After a certain point (hard disk limitation), you will not be able to improve performance.

Also launching a new goroutine for each subfolder may not be optimal, you may get better performance by limiting the number of ferrets going through your folders. To do this, check and use a working pool:

Is this an idiomatic pool of worker threads in Go?

+6


source







All Articles