NoSuchMethodError using Databricks Spark-Avro 3.2.0
I have a spark master and a worker working in docker containers with spark 2.0.2 and hadoop 2.7. I am trying to submit a job from pyspark from another container (same network) by running
df = spark.read.json("/data/test.json")
df.write.format("com.databricks.spark.avro").save("/data/test.avro")
But I am getting this error:
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
It doesn't matter if I try interactively or with spark-submit. These are my downloaded packages to sparks:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.7 from central in [default]
org.apache.avro#avro;1.8.1 from central in [default]
org.apache.commons#commons-compress;1.8.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
org.tukaani#xz;1.5 from central in [default]
org.xerial.snappy#snappy-java;1.1.1.3 from central in [default]
spark-submit --version
output:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Branch
Compiled by user jenkins on 2016-11-08T01:39:48Z
Revision
Url
Type --help for more information.
scala version - 2.11.8
My pyspark command:
PYSPARK_PYTHON=ipython /usr/spark-2.0.2/bin/pyspark --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
My spark-submit command:
spark-submit script.py --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
I read here that it might be caused by "old version of avro in use", so I tried using 1.8.1, but I still get the same error. Reading avro works fine. Any help?
source to share
The reason for this error is that apache avro version 1.7.4 is included in hasoop by default, and if the SPARK_DIST_CLASSPATH
env variable includes general chaos ( $HADOOP_HOME/share/common/lib/
) in front of ivy2 jars, the wrong version may instead of the version required by spark-avro (> = 1.7. 6) and installed in ivy2.
To check if this is the case, open spark-shell
and run
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
This should indicate the location of the class like this:
java.net.URL = jar:file:/lib/ivy/jars/org.apache.avro_avro-1.7.6.jar!/org/apache/avro/generic/GenericData.class
If this class points to $HADOOP_HOME/share/common/lib/
, then you should just include your ivy2 banks before the haop common in the SPARK_DIST_CLASSPATH
env variable .
For example, in the Dockerfile
ENV SPARK_DIST_CLASSPATH="/home/root/.ivy2/*:$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note. /home/root/.ivy2
is the default location for ivy2 banners, you can control this by setting spark.jars.ivy
to spark-defaults.conf
, which is probably a good idea.
source to share