Parallel processing in R with H2O
I am creating a piece of parallel code for some computation for N groups in my data using foreach
.
I have a computation that involves a call h2o.gbm
.
In my current sequential setup, I am using up to about 70% of my RAM.
How to set up h2o.init () correctly in parallel code snippet? I am afraid that I might run out of RAM when I am using multiple cores.
My Windows 10 computer has 12 cores and 128GB of RAM.
Would something like this pseudocode work?
library(foreach)
library(doParallel)
#setup parallel backend to use 12 processors
cl<-makeCluster(12)
registerDoParallel(cl)
#loop
df4 <-foreach(i = as.numeric(seq(1,999)), .combine=rbind) %dopar% {
df4 <- data.frame()
#bunch of computations
h2o.init(nthreads=1, max_mem_size="10G")
gbm <- h2o.gbm(train_some_model)
df4 <- data.frame(someoutput)
}
fwrite(df4, append=TRUE)
stopCluster(cl)
source to share
The way your code is configured at the moment would not be the best option. I understand what you are trying to do - execute multiple GBMs in parallel (one on a single core H2O cluster) so that you can maximize the CPU usage for 12 cores on your machine. However, your code will try to run all GBMs in a loop foreach
in parallel on a single single core H2O cluster. You can only connect to one H2O cluster at a time from one R instance, however the foreach loop will create a new R instance.
Unlike most machine learning algorithms in R, all H2O algorithms support a multi-core process, so the learning process will already be parallelized at the algorithm level without the need for a parallel R package such as foreach .
You have several options (# 1 or # 3 is probably better):
- Install
h2o.init(nthreads = -1)
at the top of your script to use all 12 of your cores. Change the loopforeach()
to a regular loop and train each GBM sequentially (on a different data partition). Although the various GBMs are trained sequentially, each individual GBM will be fully parallelized in the H2O cluster. - Set
h2o.init(nthreads = -1)
at the top of the script, but keep loopingforeach()
. This should run all of your GBMs at the same time, with each GBM being parallelized across all cores. This might overload the H2O cluster a bit (this is not exactly how H2O is supposed to be used) and might be slightly slower than # 1, but it's hard to tell without knowing the size of your data and the number of partitions you want to train. If you're already using 70% RAM for one GBM, this might not be the best option. - You can update your code to do the following (which is most similar to your original script). This will save your loop
foreach
by creating a new single core H2O cluster on a different port of your machine. See below.
Updated R code example that uses the iris dataset and returns the predictable class for iris as data.frame:
library(foreach)
library(doParallel)
library(h2o)
h2o.shutdown(prompt = FALSE)
#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)
#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
library(h2o)
port <- 54321 + 3*i
print(paste0("http://localhost:", port))
h2o.init(nthreads = 1, max_mem_size = "1G", port = port)
df4 <- data.frame()
data(iris)
data <- as.h2o(iris)
ss <- h2o.splitFrame(data)
gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}
To judge which option is better, I would try running it on multiple data partitions (10-100 perhaps) to see which approach seems to be the most ambitious. If your training data is small, it is possible that # 3 will be faster than # 1, but overall, I would say # 1 is probably the most scalable / stable solution.
source to share
Following on from Erin Ledell's answer, I just wanted to add that in many cases a decent practical solution could be between # 1 and # 3. To increase CPU utilization and still save RAM, you can use multiple instances of H2O in parallel, but each can use multiple cores without significant performance degradation compared to a large number of instances with only one core.
I ran an experiment using a relatively small 40MB dataset (240k rows, 22 columns) on a 36 core server.
-
Case 1. Use all 36 cores (nthreads = 36) to sequentially evaluate 120 GB models (with default hyperparameters).
-
Case 2: Use foreach to run 4 instances of H2O on this machine, each using 9 cores to sequentially evaluate 30GB default models (total = 120 evaluations).
-
Case 3: Use foreach to run 12 instances of H2O on this machine, each using 3 cores to sequentially evaluate 10GBM models (total = 120 evaluations).
Using 36 cores to evaluate one GBM model on this dataset is very inefficient. The CPU utilization in case 1 fluctuates a lot, but is below 50% on average. So there is definitely a lot to be gained by using more than one instance of H2O at the same time.
- Run case 1: 264 seconds
- Run Case 2: 132 seconds
- Case 3: 130 seconds
Considering the slight improvement from 4 to 12 instances of H2O, I didn't even run 36 instances of H2O, each using a single core in parallel.
source to share
I tried several ways to make H2O run in parallel from R. I have 15390 xgboost predction to run.
a) One cluster H2O + R and foreach% dopar% where I do h2o.init () to connect with unique H2O.
==> problems with an object in H2O: either h2o is trying to create a temporary object that already exists (possibly created shortly before by another worker), or it has deleted a temporary object used by another worker.
b) R cluster and foreach% dopar% loop that start and stop H2o in each loop on a different port associated with the R worker (if we just increase the port we reach the maximum port number, note that h2o uses an additional port corresponding to the one in h20.init, so the port should be (X + iter x 2).
==> Problems with H2O not starting from time to time and stopping the process.
c) Create a cluster R, the first foreach% dopar% loop that just does h2o.init (...) for each worker, the second foreach% dopar% loop that executes the main code, the third foreach% dopar% loop that just stops the water.
==> This is the most stable solution I've found, but memory is still creeping slowly in every H2O, even with removing all at the end of each iteration. I make a simpler for loop where I create a cluster and destroy it from time to time, but I come back to question b) when sometimes H2O won't start.
I'll keep looking, but if you have any other suggestions please share. thank
Point c) extracting the code here:
cl <- makeCluster(nb_worker)
registerDoParallel(cl)
getDoParWorkers()
foreach(workers=1:getDoParWorkers(), .packages=c('h2o','opera')) %dopar% {
Sys.sleep(workers) # stage the jvm start of each h2o by worker-seconds, to help with H2O start issue.
h2o.init(port = 54350+(workers*2), min_mem_size="2g", nthreads=1) ## Note that H2O use two consecutive ports
h2o.clusterStatus()
}
parallel_result = foreach(Stuff=1:NbrOfStuff), .export=c('xxx','xxxx','xxxxx'), .packages=c('h2o','xxxx')) %dopar% {
### Your code go here and finish by a iter_output as R data frame
h2o.removeAll()
return(iter_output)
} # End Foreach loop
foreach(workers=1:getDoParWorkers() ) %dopar% {
h2o.shutdown(prompt=FALSE)
}
stopCluster(cl)
rm(cl)
source to share