How to read csv file from s3 bucket using Pandas in Python

Question

How to read csv file from s3 bucket using Pandas in Python

I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas frame using the following code:

import pandas as pd
import boto

data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')

To give full access, I set the bucket policy to the S3 bucket like this:

{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
    {
        "Sid": "statement1",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::example_bucket"
    }
]

}

Unfortunately I am still getting the following error in python:

boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed

I wonder if anyone can explain how to properly set permissions in AWS S3 or configure pandas to import the file correctly. Thank!

+3

python pandas amazon-s3 amazon-web-services

Paul_M June 13. 15 at 11:50

source to share

3 answers

jpobst · Answer 1 · 2017-09-20T13:40:34+0000

Using pandas 0.20.3

import os
import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

# get your credentials from environment variables
aws_id = os.environ['AWS_ID']
aws_secret = os.environ['AWS_SECRET']

client = boto3.client('s3', aws_access_key_id=aws_id,
        aws_secret_access_key=aws_secret)

bucket_name = 'my_bucket'

object_key = 'my_file.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

Paul_M · Answer 2 · 2015-06-16T08:18:59+0000

Eventually I figured out that you also need to set permissions on every single object in the bucket in order to retrieve it using the following code:

from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')

And I also had to change the bucket address in the pd.read_csv command like this:

data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')

Anonymal · Answer 3 · 2017-07-19T14:09:19+0000

You don't need pandas .. you can just use the default python csv library

def read_file(bucket_name,region, remote_file_name, aws_access_key_id, aws_secret_access_key):
    # reads a csv from AWS

    # first you stablish connection with your passwords and region id

    conn = boto.s3.connect_to_region(
        region,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key)

    # next you obtain the key of the csv you want to read
    # you will need the bucket name and the csv file name

    bucket = conn.get_bucket(bucket_name, validate=False)
    key = Key(bucket)
    key.key = remote_file_name
    data = key.get_contents_as_string()
    key.close()

    # you store it into a string, therefore you will need to split it
    # usually the split characters are '\r\n' if not just read the file normally 
    # and find out what they are 

    reader = csv.reader(data.split('\r\n'))
    data = []
    header = next(reader)
    for row in reader:
        data.append(row)

    return data

hope it solved your problem, good luck! :)

How to read csv file from s3 bucket using Pandas in Python

More articles: