RPostgreSQL - R Connecting to Amazon Redshift - How to Use WRITE / Post Bigger

I am experimenting with how to connect R with Amazon Redshift - and post a short blog for other newbies.

Some good progress - I can do most things (create tables, fetch data and even sqlSave or dbSendQuery "row by row". HOWEVER, I haven't found a way to BULK UPLOAD tables in one (e.g. copy the whole IRIS 5X150 table / data table in Redshift ) - it won't take more than a minute.

Q: Any advice for a new RPostgreSQL user on how to write / unload a data block in Redshift would be greatly appreciated!

RODBC:

colnames(iris) <- tolower(colnames(iris)) 
sqlSave(channel,iris,"iris", rownames=F) 

      

SLOOOOOOW! SO SLOW! There must be a better way 150 ~ 1.5 minutes

iris_results <- sqlQuery(channel,"select * from iris where species = 'virginica'") # fast subset. this does work and shows up on AWS Redshift Dashboard

sqlDrop(channel, "iris", errors = FALSE) # clean up our toys

      

RPostgreSQL

dbSendQuery(con, "create table iris_200 (sepallength float,sepalwidth float,petallength float,petalwidth float,species VARCHAR(100));")
dbListFields(con,"iris_200")

      

ONE BY ONE insert four rows into table

dbSendQuery(con, "insert into iris_200 values(5.1,3.5,1.4,0.2,'Iris-setosa');")

dbSendQuery(con, "insert into iris_200 values(5.5,2.5,1.1,0.4,'Iris-setosa');")

dbSendQuery(con, "insert into iris_200 values(5.2,3.3,1.2,0.3,'Iris-setosa');")

dframe <-dbReadTable(con,"iris_200") # ok

dbRemoveTable(con,"iris_200")  # and clean up toys

      

or a circular table (takes about 1 per second)

for (i in 1:(dim(iris_200)[1]) ) {
query <- paste("insert into iris_200 values(",iris_200[i,1],",",iris_200[i,2],",",
iris_200[i,3],",",iris_200[i,4],",","'",iris_200[i,5],"'",");",sep="")

print(paste("row",i,"loading data >>  ",query))

dbSendQuery(con, query)
}

      

So, in short, this is a hacky / slow way - any advice on how to load / insert bulk data is appreciated - thanks !!

The complete code is here:

PS - got this error message: LOAD source is not supported. (Hint: Only S3 or DynamoDB or EMR downloads allowed)


Update 6/12/2015 - Direct download of bulk data at a reasonable speed may not be possible by noting the error message above and noted on this blog - DOWNLOAD DATA section http://dailytechnology.net/2013/08/03/redshift-what -you-need-to-know /

He notes

So now that we have created the data structure, how do we get the data into it? You have two options: 1) Amazon S3 2) Unknown Executor Yes, you could just run a series of INSERT statements, but that would be painfully slow. (!)

Amazon recommends using the S3 method, which I will describe briefly. I don't see DynamoDB being particularly useful if you're not already using it and want to migrate some data to Redshift.

To get data from your local network to S3 .....

RA: Will post updates if I figure it out

+3


source to share


1 answer


Maybe too late for the OP, but I'll post this here for future reference in case someone finds the same problem:

To make a Bulk tab, follow these steps:

  • Create a table in Redshift with the same structure as my dataframe
  • Divide data into N pieces
  • Convert the parts to a format readable with Redshift
  • Download all parts to Amazon S3
  • Run COPY statement in Redshift
  • Delete temporary files on Amazon S3

I've created an R package that does exactly that, except for the first step, and its called redshiftTools: https://github.com/sicarul/redshiftTools

To install the package, you will need:



install.packages('devtools')
devtools::install_github("RcppCore/Rcpp")
devtools::install_github("rstats-db/DBI")
devtools::install_github("rstats-db/RPostgres")
devtools::install_github("hadley/xml2")
install.packages("aws.s3", repos = c(getOption("repos"), "http://cloudyr.github.io/drat"))
devtools::install_github("sicarul/redshiftTools")

      

Subsequently, you can use it like this:

library("aws.s3")
library(RPostgres)
library(redshiftTools)

con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
host='my-redshift-url.amazon.com', port='5439',
user='myuser', password='mypassword',sslmode='require')

rs_replace_table(my_data, dbcon=con, tableName='mytable', bucket="mybucket")
rs_upsert_table(my_other_data, dbcon=con, tableName = 'mytable', bucket="mybucket", keys=c('id', 'date'))

      

rs_replace_table

truncates the target table and then loads it completely from the dataframe, only do this if you don't care about the current data it stores. On the other hand, it rs_upsert_table

replaces rows that have matching keys and inserts those that do not exist in the table.

+2


source







All Articles