R reads ORC file from S3

We will be hosting an EMR cluster (with exact instances) on AWS, running on top of an S3 bucket. The data will be stored in this bucket in ORC format. However, we want to use R as well as some sandbox environment, reading the same data.

I have the aws.s3 (cloudyr) package that works correctly: I can read the csv files without issue, but I can't seem to convert the orc files to something readable.

Two options I found online: - SparkR - dataconnector (vertica)

Since installing dataconnector on a Windows machine was troublesome, I installed SparkR and now I can read the local orc.file (R local on my machine, orc file on my local machine). However, if I try read.orc it will normalize my path to local path by default. Digging into the source code, I ran the following:

sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", my_path)

      

But I got the following error:

Error: Error in orc : java.io.IOException: No FileSystem for scheme: https

      

Can anyone help me with this problem or point out an alternative way to download orc files from S3?

+2


source to share


1 answer


Edited answer: You can now directly read S3 instead of first boot and read from local filesystem

As requested by mrjoseph: Possible solution via SparkR (which I didn't want to do in the first place).



# Set the System environment variable to where Spark is installed
Sys.setenv(SPARK_HOME="pathToSpark")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "org.apache.hadoop:hadoop-aws:2.7.1" "sparkr-shell"')

# Set the library path to include path to SparkR
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))

# Set system environments to be able to load from S3
Sys.setenv("AWS_ACCESS_KEY_ID" = "myKeyID", "AWS_SECRET_ACCESS_KEY" = "myKey", "AWS_DEFAULT_REGION" = "myRegion")

# load required packages
library(aws.s3)
library(SparkR)

## Create a spark context and a sql context
sc<-sparkR.init(master = "local")
sqlContext<-sparkRSQL.init(sc)

# Set path to file
path <- "s3n://bucketname/filename.orc"

# Set hadoop configuration
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", "fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsAccessKeyId", "myAccesKey")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsSecretAccessKey", "mySecrectKey")

# Slight adaptation to read.orc function
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
# Not required: path <- normalizePath(path)
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", path)
temp <- SparkR:::dataFrame(sdf)

# Read first lines
head(temp)

      

+2


source







All Articles