Running EMR Spark with multiple S3 accounts

Question

Running EMR Spark with multiple S3 accounts

I have an EMR Spark job that needs to read data from S3 for one account and write to another.
I have divided my work into two stages.

read data from S3 (no credentials required as my EMR cluster is on the same account).
read the data in the local HDFS created in step 1 and write it to the S3 bucket in another account.

I tried to install hadoopConfiguration

:

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<your access key>")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<your secretkey>")

And exporting keys in the cluster:

$ export AWS_SECRET_ACCESS_KEY=
$ export AWS_ACCESS_KEY_ID=

I unsuccessfully tried to cluster mode and the client , as well as spark The-the shell .

Each of them returns an error:

ERROR ApplicationMaster: User class threw exception: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: 
Access Denied

+9

amazon-s3 apache-spark amazon-emr

jspooner 01 nov. 16 at 16:40

source to share

3 answers

I believe you need to assign the IAM role to your compute nodes (you may have done that already) and then cross-account that role via IAM in the "Remote" account. See http://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html for details .

0

easel 01 nov. '16 at 18:30

source to share

Using spark, you can also use the presumptive role to access the s3 bucket in a different account, but using the IAM role in a different account. This makes it easier for the other account owner to manage the permissions granted to the spark job. Controlling access through S3 Bucket policies can be difficult because access rights are spread across multiple locations rather than all contained in a single IAM role.

Here is hadoopConfiguration

:

"fs.s3a.credentialsType" -> "AssumeRole",
"fs.s3a.stsAssumeRole.arn" -> "arn:aws:iam::<<AWSAccount>>:role/<<crossaccount-role>>",
"fs.s3a.impl" -> "com.databricks.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.server-side-encryption-algorithm" -> "aws:kms",
"spark.hadoop.fs.s3a.server-side-encryption-kms-master-key-id" -> "arn:aws:kms:ap-southeast-2:<<AWSAccount>>:key/<<KMS Key ID>>"

External identifiers can also be used as a passphrase:

"spark.hadoop.fs.s3a.stsAssumeRole.externalId" -> "GUID created by other account owner"

We used chunks for the above, haven't tried EMR yet.

0

WaterK 02 oct. At 4:04 am

source to share

John rotenstein · Accepted Answer · 2016-11-01T21:32:26+0000

The solution is actually quite simple.

First, EMR clusters have two roles:

A service role ( EMR_DefaultRole

) that grants permissions for the EMR service (for example, to start Amazon EC2 instances)
an EC2 ( EMR_EC2_DefaultRole

) role attached to EC2 instances running in a cluster, giving them access to AWS credentials (see Using the IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances )

These roles are explained in: Default IAM Roles for Amazon EMR

Therefore, each EC2 instance running in the cluster is assigned a role EMR_EC2_DefaultRole

that provides temporary credentials through the instance metadata service . (For an explanation of how this works, see: IAM Roles for Amazon EC2 .) Amazon EMR hosts use these credentials to access AWS services such as S3, SNS, SQS, CloudWatch, and DynamoDB.

Secondly, you will need to add permissions to the Amazon S3 bucket in another account to allow access through the role EMR_EC2_DefaultRole

. This can be done by adding bucket policy (referred to here other-account-bucket

) to the S3 array as follows:

{
    "Id": "Policy1",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::other-account-bucket",
                "arn:aws:s3:::other-account-bucket/*"
            ],
            "Principal": {
                "AWS": [
                    "arn:aws:iam::ACCOUNT-NUMBER:role/EMR_EC2_DefaultRole"
                ]
            }
        }
    ]
}

This policy grants all S3 ( s3:*

) rights to the role EMR_EC2_DefaultRole

owned by the account corresponding ACCOUNT-NUMBER

in the policy, which must be the account in which the EMR cluster was started. Be careful when granting these permissions - you can only grant permissions for GetObject

, not all S3 permissions.

It's all! A bucket in a different account will now accept requests from EMR nodes because they are using a role EMR_EC2_DefaultRole

.

Disclaimer: . I tested above by creating a bucket in Account-A and assigning permissions (as shown above) to a role in Account-B. An EC2 instance was launched in Account-B with this role. I was able to access the bucket from an EC2 instance using the AWS CLI . I haven't tested it in EMR, but it should work the same.

Running EMR Spark with multiple S3 accounts

More articles: