Firehose out to s3 - parification from month format in day day format in dt = YY-MM-DD format

I work for the AWS EMR ecosystem.

I'm looking for a clever way to remake aws firehose:

s3: // bucket / YYYY / MM / DD / HH

to hive section format

s3: // bucket / dt = YY-MM-DD-HH

Any suggestions?

Thanks, Omid

+3


source to share


3 answers


Added the same answer in Boto3 (to match the current default lambda packaging)



import re
import boto3

##set buckets:
source_bucket='walla-anagog-us-east-1'
destination_bucket='walla-anagog-eu-west-1'

## regex from from YYYY/MM/DD/HH to dt=YYYY-MM-DD   
##replaced_file = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3' , file)

client = boto3.client('s3')
s3 = boto3.resource('s3')
mybucket = s3.Bucket(source_bucket)

for object in mybucket.objects.all():
    replaced_key = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3' , object.key)
    print(object.key)
    client.copy_object(Bucket=destination_bucket, CopySource=source_bucket+"/"+object.key, Key=replaced_key, ServerSideEncryption='AES256')
    client.delete_object(Bucket=source_bucket, Key=object.key)

      

0


source


We solved this problem using S3DistCp. We perform hourly aggregations of data, group them by pattern, and output them to directories that are prefixed in advance.

This is definitely a feature that Firehose is missing and there is currently no way to do it using Firehose alone.



http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

+1


source


I used python and boto to move files and redistribute them. I have applied regex to rename the key from YYYY / MM / DD / HH to dt = YY-MM-DD-HH

Code snippet (notice to remove src key):

from boto.s3.connection import S3Connection
import re

conn = S3Connection('xxx','yyy')

##get buckets:
source_bucket='srcBucketName'
destination_bucket='dstBucketName'

src = conn.get_bucket(source_bucket)
dst = conn.get_bucket(destination_bucket)

##Iterate
for key in src.list():
     #print key.name.encode('utf-8')
     file = key.name.encode('utf-8')    

     replaced_file = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3-\4' , file)
     #print replaced_file

     #actual copy    
     dst.copy_key(replaced_file,src.name,file,encrypt_key=True )
     key.delete()

      

0


source







All Articles