How do you use s3a with spark 2.1.0 on aws us-east-2?

Question

How do you use s3a with spark 2.1.0 on aws us-east-2?

Background

I am working on making a flexible setup for myself to use spark on aws with docker mode. The docker image I used is set to use the latest spark, which at the time is 2.1 with Hadoop 2.7.3 and is available at jupyter / pyspark notebook .

This works and I just went through to test the various connection paths I plan to use. The problem I am facing is uncertainty about how to interact with s3 correctly. I followed how to enforce spark dependencies on data connection on aws s3 using protocol s3a

, vs protocol s3n

.

Finally I came across hadoop aws guide and thought I was following how to provide configuration. However, I was still getting the error 400 Bad Request

as shown in this question , which describes how to fix it by defining the endpoint, which I already did.

I ended up getting too far from the default configuration while on us-east-2

, which made me undecided if I had a problem with the jar files. To fix the region issue, I set the items to the regular region us-east-1

and I was finally able to connect to s3a

. So I narrowed down the problem in a region, but thought I was doing whatever it takes to work in another region.

Question

What is the correct way to use config variables for hasoop in spark to use us-east-2

?

Note. This example uses local execution mode for simplicity.

import os
import pyspark

I can see this boot in the laptop console after creating the context, and adding this data made me completely broke to get a Bad Request error.

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)

For aws config, I tried both the method below and just using the above conf

and followed a pattern conf.set(spark.hadoop.fs.<config_string>, <config_value>)

equivalent to what I am doing below, except that it was I set the values to conf

before creating the spark context.

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)

It should be noted that I also tried an alternate endpoint for us-east-2

s3-us-east-2.amazonaws.com

.

Then I read the parquet data from s3.

df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()

Again, after moving the EC2 instance to us-east-1 and commenting out the endpoint config, this works for me. For me, apparently the final config is not used for some reason.

+3

amazon-s3 hadoop apache-spark pyspark parquet

Nick roth Apr 17 17 at 15:09

source to share

1 answer

Steve loughran · Answer 1 · 2017-04-18T09:46:04+0000

us-east-2 is a V4 auth S3 instance, so fs.s3a.endpoint should be set as you are arguing.

if it is not brought up then suppose the config you are setting is not the one used to access the bucket. Be aware that Hadoop caches file system instances by URI even when config changes. First attempt to access file system patches, config, even when no personal information is available.

Some tactics

set spark value to default
using the config you just created, try to explicitly load the filesystem with a call Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf)

, returning a bucket with that config (if it doesn't already exist). I don't know how to make this call to .py though.
set the property "fs.s3a.impl.disable.cache"

to true to bypass the cache before the get command

Adding additional BadAuth error diagnostics along with the wiki page is a feature specified for S3A phase III. If you want to add it along with the test, I can view it and get it in

How do you use s3a with spark 2.1.0 on aws us-east-2?

More articles: