How do I connect to Amazon Redshift or another database in Apache Spark?

I am trying to connect to Amazon Redshift via Spark, so I can merge data on S3 with our RS cluster data. I found some very spartan documentation here about JDBC connectivity:

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases

The load command looks pretty straightforward (although I don't know how I would enter my AWS credentials here, perhaps in the settings?).

df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

      

And I'm not really sure how to handle the SPARK_CLASSPATH variable. I am currently running Spark locally through an iPython laptop (as part of the Spark distribution). Where can I tell Spark is loading it?

Anyway, while trying to run these commands, I am getting a bunch of illegible errors, so I'm stuck now. Any help or pointers to detailed guides are appreciated.

+3


source to share


4 answers


It turns out that you only need the username / pwd to access Redshift in Spark, and it is done like this (using the Python API):

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", 
                     url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret", 
                     dbtable="schema.table")

      



Hope this helps someone!

+4


source


If you are using Spark 1.4.0 or newer, take a look spark-redshift

at a library that supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you are querying large amounts of data, this approach should perform better than JDBC because it will be able to offload and query data in parallel.

If you still want to use JDBC, check out the new built-in JDBC datasource in Spark 1.4+.



Disclosure: I am one of the authors spark-redshift

.

+4


source


First you need to download the Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/

You can define your SPARK_CLASSPATH environment variable in .bashrc

conf / spark-env.sh or a similar file, or specify it in a script before starting up your IPython notebook.

You can also define it in conf / spark-defaults.conf like this:

spark.driver.extraClassPath  /path/to/file/postgresql-9.4-1201.jdbc41.jar

      

Make sure it appears in the Environment tab of your Spark WebUI.

You will also need to set the appropriate AWS credentials as follows:

sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")

      

+2


source


While this seems to be a very old post, anyone still looking for an answer below steps worked for me!

Start the shell including the jar.

bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar

      

Create a df with the appropriate details:

myDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://host:port/db_name") \
    .option("dbtable", "table_name") \
    .option("user", "user_name") \
    .option("password", "password") \
    .load()

      

Spark: 2.2

+2


source







All Articles