Efficient memory prediction with RandomForest in R
TL; DR I want to know memory efficient ways for batch prediction with randomForest models built on large datasets (hundreds of functions, 10k rows).
More details
I am working with a large dataset (over 3GB, in memory) and want to do a simple binary classification using randomForest
. Since my data is proprietary I cannot share it, but let's say the following code works
library(randomForest)
library(data.table)
myData <- fread("largeDataset.tsv")
myFeatures <- myData[, !c("response"), with = FALSE]
myResponse <- myData[["response"]]
toBePredicted <- fread("unlabeledData.tsv")
rfObj <- randomForest(x = myFeatures, y = myResponse, ntree = 100L)
predictedLabels <- predict(rfObj, toBePredicted)
However, it takes up several GB of memory.
I know that I can save memory by disabling a bunch of proximity and importance measures and arguments keep.*
:
rfObjWithPreds <- randomForest(x = myFeatures,
y = myResponse,
proximity = FALSE,
localImp = FALSE,
importance = FALSE,
ntree = 100L,
keep.forest = FALSE,
keep.inbag = FALSE,
xtest = toBePredicted)
However, I am now wondering if this is the most memory efficient way of getting predictions for toBePredicted
. Another way I could do this is to grow the forest in parallel and do garbage collection actively:
library(doParallel)
registerDoParallel(ncores = 5)
subForestVotes <- foreach(subForestNumber = iter(seq.int(5)),
.combine = cbind) %dopar% {
rfObjWithPreds <- randomForest(x = myFeatures,
y = myResponse,
proximity = FALSE,
localImp = FALSE,
importance = FALSE,
ntree = 100L,
keep.forest = FALSE,
keep.inbag = FALSE,
xtest = toBePredicted)
output <- rfObjWithPreds[["test"]][["votes"]]
rm(rfObjWithPreds)
return(output)
}
Does anyone have a smarter way to forecast effectively toBePredicted
?
source to share
No one has answered this question yet
See similar questions:
or similar: