Spark sql throws java.lang.OutOfMemoryError in yarn clusters mode but running in yarn client mode

I have a simple beehive request that works fine in yarn run mode using the pyspark shell, where is how it throws me the error below when I run it in thread cluster mode.

Exception in thread "Thread-6" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-6"
Exception in thread "Reporter" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Reporter" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"

      

Cluster info: Hadoop 2.4, Spark 1.4.0-hadoop2.4, hive 0.13.1 The script takes 10 columns from the hive table and does some transformations and writes them to a file.

> num-executors 200 executor-memory 8G driver-memory 16G executor-cores 3

      

Full stack trace:

py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o62.javaToPython.
: java.lang.OutOfMemoryError: PermGen space at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2570)
    at java.lang.Class.getDeclaredMethods(Class.java:1855)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:206)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
    at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682)
    at org.apache.spark.api.python.SerDeUtil$.javaToPython(SerDeUtil.scala:140)
    at org.apache.spark.sql.DataFrame.javaToPython(DataFrame.scala:1435)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

      

+3


source to share


2 answers


java.lang.OutOfMemoryError: PermGen space in java.lang.ClassLoader.defineClass1 (...

You are probably running out of "constant generation" empty space in the "driver JVM". This area is used to store classes. When we run in cluster mode, the JVM needs to load more classes (I think this is because the Application Manager runs inside the same JVM as the driver). To increase the scope of PermGen, add the following parameter:

--driver-java-options -XX:MaxPermSize=256M

      

See also https://plumbr.eu/outofmemoryerror/permgen-space


When using HiveContext in your Python program, I found that the following parameter is also required:



--files /usr/hdp/current/spark-client/conf/hive-site.xml

      

See also https://community.hortonworks.com/questions/27239/executing-spark-submit-with-yarn-cluster-mode-and.html


I also wanted to specify a specific Python version to use, which requires a different option:

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/local/bin/python2.7

      

See also https://issues.apache.org/jira/browse/SPARK-9235

+1


source


Small addition to Mark's answer - sometimes Spark with HiveContext complains about OutOfMemoryError without mentioning PermGen, however only -XX: MaxPermSize helps.



So if you are working with OOM when using Spark + HiveContext also try -XX: MaxPermSize

0


source







All Articles