Java.util.HashMap is missing from PySpark session
I am working with Apache Spark 1.4.0 on Windows 7 x64 with Java 1.8.0_45 x64 and Python 2.7.10 x86 in IPython 3.2.0
I'm trying to write a DataFrame program in an IPython notebook that reads and writes back to a SQL Server database.
So far, I could read data from the database
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:sqlserver://serverURL", dbtable="dbName.tableName", driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password")
and convert data to Panda and do whatever I want. (It was more than a little hassle, but it works after adding Microsoft sqljdbc42.jar to spark.driver.extraClassPath in spark-defaults.conf)
The current issue occurs when I post data back to SQL Server with the DataFrameWriter API :
df.write.jdbc("jdbc:sqlserver://serverURL", "dbName.SparkTestTable1", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<ipython-input-19-8502a3e85b1e> in <module>()
----> 1 df.write.jdbc("jdbc:sqlserver://jdbc:sqlserver", "dbName.SparkTestTable1", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))
C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\readwriter.pyc in jdbc(self, url, table, mode, properties)
394 for k in properties:
395 jprop.setProperty(k, properties[k])
--> 396 self._jwrite.mode(mode).jdbc(url, table, jprop)
397
398
C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
302 raise Py4JError(
303 'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'.
--> 304 format(target_id, '.', name, value))
305 else:
306 raise Py4JError(
Py4JError: An error occurred while calling o49.mode. Trace:
py4j.Py4JException: Method mode([class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
The problem is that py4j cannot find the Java class java.util.HashMap
when it is about to convert the connectionProperties dictionary to a JVM object. Adding my rt.jar (from path) to spark.driver.extraClassPath doesn't solve the problem. Removing the dictionary from the write command avoids this error, but of course the write fails due to lack of driver and authentication.
Edit: part of the error o49.mode
changes from run to run.
source to share
Davis Liu on the Spark users mailing list found a problem . There is a subtle difference between Scala and the Python API that I missed. You must pass a mode string (eg "rewrite") as the third parameter to the Python API, but not to the Scala API. Changing the instruction as follows fixes this issue:
df.write.jdbc("jdbc:sqlserver://serverURL", "dbName.SparkTestTable1", "overwrite", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))
source to share