How do I connect to Amazon Redshift or another database in Apache Spark?

Question

How do I connect to Amazon Redshift or another database in Apache Spark?

I am trying to connect to Amazon Redshift via Spark, so I can merge data on S3 with our RS cluster data. I found some very spartan documentation here about JDBC connectivity:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command looks pretty straightforward (although I don't know how I would enter my AWS credentials here, perhaps in the settings?).

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

And I'm not really sure how to handle the SPARK_CLASSPATH variable. I am currently running Spark locally through an iPython laptop (as part of the Spark distribution). Where can I tell Spark is loading it?

Anyway, while trying to run these commands, I am getting a bunch of illegible errors, so I'm stuck now. Any help or pointers to detailed guides are appreciated.

+3

amazon-s3 amazon-web-services amazon-redshift apache-spark

Evan Zamir Jul 14 15 at 12:35 am

source to share

4 answers

Evan Zamir · Answer 1 · 2015-07-14T17:14:29+0000

It turns out that you only need the username / pwd to access Redshift in Spark, and it is done like this (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table")

Hope this helps someone!

Josh rosen · Answer 2 · 2015-09-13T00:13:49+0000

If you are using Spark 1.4.0 or newer, take a look spark-redshift

at a library that supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you are querying large amounts of data, this approach should perform better than JDBC because it will be able to offload and query data in parallel.

If you still want to use JDBC, check out the new built-in JDBC datasource in Spark 1.4+.

Disclosure: I am one of the authors spark-redshift

.

Maksud · Answer 3 · 2015-07-14T07:38:51+0000

First you need to download the Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/

You can define your SPARK_CLASSPATH environment variable in .bashrc

conf / spark-env.sh or a similar file, or specify it in a script before starting up your IPython notebook.

You can also define it in conf / spark-defaults.conf like this:

spark.driver.extraClassPath  /path/to/file/postgresql-9.4-1201.jdbc41.jar

Make sure it appears in the Environment tab of your Spark WebUI.

You will also need to set the appropriate AWS credentials as follows:

sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")

Sumit · Answer 4 · 2017-10-24T12:44:25+0000

While this seems to be a very old post, anyone still looking for an answer below steps worked for me!

Start the shell including the jar.

bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar

Create a df with the appropriate details:

myDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://host:port/db_name") \
    .option("dbtable", "table_name") \
    .option("user", "user_name") \
    .option("password", "password") \
    .load()

Spark: 2.2

How do I connect to Amazon Redshift or another database in Apache Spark?

More articles: