H2O server crash
I have been working with H2O for the last year and I am very tired of server crashes. I ditched "nightly releases" as they break easily across my datasets. Please tell me where I can download the stable version.
Charles
My environment:
- Windows 10 Enterprise, Build 1607, 64GB.
- Java SE Development Kit 8 Update 77 (64-bit).
- Anaconda Python 3.6.2-0.
I started the server with
localH2O = h2o.init(ip = "localhost",
port = 54321,
max_mem_size="12G",
nthreads = 4)
Information on getting started h2o:
H2O cluster uptime: 12 hours 12 mins
H2O cluster version: 3.10.5.2
H2O cluster version age: 1 month and 6 days
H2O cluster name: H2O_from_python_Charles_ji1ndk
H2O cluster total nodes: 1
H2O cluster free memory: 6.994 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.2 final
Crash information:
OSError: Job with key $03017f00000132d4ffffffff$_a0ce9b2c855ea5cff1aa58d65c2a4e7c failed with an exception: java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
stacktrace:
java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
at water.MemoryManager.set_goals(MemoryManager.java:97)
at water.MemoryManager.malloc(MemoryManager.java:265)
at water.MemoryManager.malloc(MemoryManager.java:222)
at water.MemoryManager.arrayCopyOfRange(MemoryManager.java:291)
at water.AutoBuffer.expandByteBuffer(AutoBuffer.java:719)
at water.AutoBuffer.putA4f(AutoBuffer.java:1355)
at hex.deeplearning.Storage$DenseRowMatrix$Icer.write129(Storage$DenseRowMatrix$Icer.java)
at hex.deeplearning.Storage$DenseRowMatrix$Icer.write(Storage$DenseRowMatrix$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:771)
at water.AutoBuffer.putA(AutoBuffer.java:883)
at hex.deeplearning.DeepLearningModelInfo$Icer.write128(DeepLearningModelInfo$Icer.java)
at hex.deeplearning.DeepLearningModelInfo$Icer.write(DeepLearningModelInfo$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:771)
at hex.deeplearning.DeepLearningModel$Icer.write105(DeepLearningModel$Icer.java)
at hex.deeplearning.DeepLearningModel$Icer.write(DeepLearningModel$Icer.java)
at water.Iced.write(Iced.java:61)
at water.Iced.asBytes(Iced.java:42)
at water.Value.<init>(Value.java:348)
at water.TAtomic.atomic(TAtomic.java:22)
at water.Atomic.compute2(Atomic.java:56)
at water.Atomic.fork(Atomic.java:39)
at water.Atomic.invoke(Atomic.java:31)
at water.Lockable.unlock(Lockable.java:181)
at water.Lockable.unlock(Lockable.java:176)
at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:491)
at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:311)
at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
source to share
You need a big boat.
The error message says "heapUsedGC = 11482667352" which is higher than MEM_MAX. Instead of giving max_mem_size="12G"
, why not give it more of the 64GB you have? Or build a less ambitious model (fewer hidden nodes, less training data, something like that).
(Obviously, ideally h2o shouldn't crash and should gracefully break when it gets close to using up all available memory. If you can share your data / code with H2O it might be worth opening a bug report on their JIRA.)
BTW, I ran h2o 3.10.xx as the main webserver process for 9 months or so, automatically restarting it over the weekend and had no crashes. Well I did - after I left it for 3 weeks, it filled my memory with more and more data and models. That's why I switched it to reboot weekly and only kept the models I needed in memory. (For example, this is an AWS instance, 4GB of memory, restart is done via cron job and bash commands.)
source to share
You can always download the latest stable release from https://www.h2o.ai/download (there is a link that says "latest stable release"). The latest stable Python package can be downloaded via PyPI and Anaconda ; the latest stable R package is available on CRAN.
I agree with Darren that you will probably need more memory - if your H2O cluster is running low on memory, H2O shouldn't crash. We usually say that you should have a cluster that is at least 3-4 times your set of on-disk trainings to train the model. However, if you are meshing models or many models, you need to increase the amount of memory so that you have enough RAM to store all of these models.
source to share