Recursively traverse the LARGE directory using Scala 2.8 continuation
Is it possible to recursively traverse a directory using Scala continuations (introduced in 2.8)?
My directory contains millions of files, so I cannot use a Stream
because I will get inactive memory. I am trying to write a Actor
mailing list so that the participating operators process files in parallel.
Does anyone have an example?
source to share
If you want to stick with Java 1.6 (as opposed FileVistor
to 1.7), and you have subdirectories and not all of your millions of files in just one directory, you can
class DirectoryIterator(f: File) extends Iterator[File] {
private[this] val fs = Option(f.listFiles).getOrElse(Array[File]())
private[this] var i = -1
private[this] var recurse: DirectoryIterator = null
def hasNext = {
if (recurse != null && recurse.hasNext) true
else (i+1 < fs.length)
}
def next = {
if (recurse != null && recurse.hasNext) recurse.next
else if (i+1 >= fs.length) {
throw new java.util.NoSuchElementException("next on empty file iterator")
}
else {
i += 1;
if (fs(i).isDirectory) recurse = new DirectoryIterator(fs(i))
fs(i)
}
}
}
This requires that there are no loops on your filesystem. If it has loops, you need to keep track of the directories you hit in the set and avoid repeating them again. (If you don't even want to hit the files twice if they are related to each other from two different locations, you need to put everything in a set, and there isn't much point in using an iterator rather than just reading all the information about a file into memory.)
source to share
This asks a question more than an answer.
If your process is I / O bound, parallel processing may not improve your throughput. In many cases, this will worsen the situation by causing the disk head to be bumped. Before doing much on this line, look at how busy the disk is. If it is already busy most of the time with one thread, at most one thread will be useful - and even that can be counterproductive.
source to share