Processing data stored in Redshift

Question

Processing data stored in Redshift

We are currently using Redshift as our data store and we are very happy with it. However, we now have a requirement to do machine learning against data in our warehouse. Given the amount of data, ideally I would like to perform the computation in the same location as the data and not move the data around, but this is not possible with Redshift. I have looked at MADlib, but this is not an option as Redshift does not support UDF (which MADlib requires). I am currently looking at migrating data to EMR and processing it with the Apache Spark learning library (or perhaps H20, or Mahout, or whatever). So my questions are:

Is there a better way?
If not, how do I make the data available to Spark? The options I've defined so far include: using Sqoop to load it to HDFS, using DBInputFormat, exporting Redshift to S3, and using Spark from there. What are the advantages / disadvantages of these different approaches (and any others) when using Spark?

Please note that this is offline batch training, but we would like to do it as quickly as possible so that we can experiment quickly and iteratively.

+3

amazon-redshift hadoop apache-spark

deanj 12 nov. '14 at 17:30

source to share

2 answers

Josh rosen · Answer 1 · 2015-09-13T00:15:24+0000

If you want to query Redshift data in Spark, and you are using Spark 1.4.0 or newer, take a look spark-redshift

which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you are requesting large amounts of data, this approach should perform better than JDBC because it will be able to dump and query data in parallel. If you plan on running many different ML jobs on your Redshift data, consider using spark-redshift

to export it from Redshift and save it to S3 in an efficient file format like Parquet.

Disclosure: I am one of the authors spark-redshift

.

Yuri levinsky · Answer 2 · 2014-12-09T15:22:45+0000

You can run Spark alongside an existing Hadoop cluster by simply running it as a separate service on the same machines. To access Hadoop data from Spark, just use hdfs: // URL (usually hdfs: //: 9000 / path, but you can find the correct URL on the Hadoop Namenodes web interface). Alternatively, you can set up a separate cluster for Spark and still have access to HDFS over the network; this will be slower than local disk access, but may not be a problem if you are still on the same local network (for example, you host multiple Spark machines on each rack you have Hadoop on). You can use Data Pipeline service or just copy command to migrate data from Redshift to HDFS. You can use Redshift for machine learning anyway,depends on the tool you are using or the implementation algorithm. Either way, it's less database and more data warehouses with all the pros and cons.

Processing data stored in Redshift

More articles: