Efficient memory prediction with RandomForest in R

TL; DR I want to know memory efficient ways for batch prediction with randomForest models built on large datasets (hundreds of functions, 10k rows).

More details

I am working with a large dataset (over 3GB, in memory) and want to do a simple binary classification using randomForest

. Since my data is proprietary I cannot share it, but let's say the following code works

library(randomForest)
library(data.table)

myData <- fread("largeDataset.tsv")
myFeatures <- myData[, !c("response"), with = FALSE]
myResponse <- myData[["response"]]

toBePredicted <- fread("unlabeledData.tsv")

rfObj <- randomForest(x = myFeatures, y = myResponse, ntree = 100L)

predictedLabels <- predict(rfObj, toBePredicted)

      

However, it takes up several GB of memory.

I know that I can save memory by disabling a bunch of proximity and importance measures and arguments keep.*

:

rfObjWithPreds <- randomForest(x = myFeatures,
                               y = myResponse,
                               proximity = FALSE,
                               localImp = FALSE,
                               importance = FALSE,
                               ntree = 100L,
                               keep.forest = FALSE,
                               keep.inbag = FALSE,
                               xtest = toBePredicted)

      

However, I am now wondering if this is the most memory efficient way of getting predictions for toBePredicted

. Another way I could do this is to grow the forest in parallel and do garbage collection actively:

library(doParallel)
registerDoParallel(ncores = 5)

subForestVotes <- foreach(subForestNumber = iter(seq.int(5)),
                          .combine = cbind) %dopar% {
    rfObjWithPreds <- randomForest(x = myFeatures,
                               y = myResponse,
                               proximity = FALSE,
                               localImp = FALSE,
                               importance = FALSE,
                               ntree = 100L,
                               keep.forest = FALSE,
                               keep.inbag = FALSE,
                               xtest = toBePredicted)
   output <- rfObjWithPreds[["test"]][["votes"]]
   rm(rfObjWithPreds)
   return(output)
}

      

Does anyone have a smarter way to forecast effectively toBePredicted

?

+3


source to share





All Articles