Firehose out to s3 - parification from month format in day day format in dt = YY-MM-DD format
3 answers
Added the same answer in Boto3 (to match the current default lambda packaging)
import re
import boto3
##set buckets:
source_bucket='walla-anagog-us-east-1'
destination_bucket='walla-anagog-eu-west-1'
## regex from from YYYY/MM/DD/HH to dt=YYYY-MM-DD
##replaced_file = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3' , file)
client = boto3.client('s3')
s3 = boto3.resource('s3')
mybucket = s3.Bucket(source_bucket)
for object in mybucket.objects.all():
replaced_key = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3' , object.key)
print(object.key)
client.copy_object(Bucket=destination_bucket, CopySource=source_bucket+"/"+object.key, Key=replaced_key, ServerSideEncryption='AES256')
client.delete_object(Bucket=source_bucket, Key=object.key)
0
source to share
We solved this problem using S3DistCp. We perform hourly aggregations of data, group them by pattern, and output them to directories that are prefixed in advance.
This is definitely a feature that Firehose is missing and there is currently no way to do it using Firehose alone.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
+1
source to share
I used python and boto to move files and redistribute them. I have applied regex to rename the key from YYYY / MM / DD / HH to dt = YY-MM-DD-HH
Code snippet (notice to remove src key):
from boto.s3.connection import S3Connection
import re
conn = S3Connection('xxx','yyy')
##get buckets:
source_bucket='srcBucketName'
destination_bucket='dstBucketName'
src = conn.get_bucket(source_bucket)
dst = conn.get_bucket(destination_bucket)
##Iterate
for key in src.list():
#print key.name.encode('utf-8')
file = key.name.encode('utf-8')
replaced_file = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3-\4' , file)
#print replaced_file
#actual copy
dst.copy_key(replaced_file,src.name,file,encrypt_key=True )
key.delete()
0
source to share