Shrinking a Map - How to Schedule Data Files
I would like to use AWS EMR to query for large log files that I will write to S3. I can create files the way I like. Data is generated at 10 kbps.
The logs are made up of dozens of data points and I would like to collect data over a very long period of time (years) to compare trends, etc.
What are the best practices for creating such files to be stored on S3 and requested by the AWS EMR cluster?
What are the optimal file sizes? Should I create separate files, for example, hourly?
What's the best way to name your files?
Should I put them in daily / hourly buckets or all in one bucket?
What is the best way to handle things like adding some data after a while, or changing the data structure that I am using?
Should I compress things like leaving domain names from urls or keeping as much data as possible?
Is there any concept like splitting (data is based on 100 sites, so I can use site IDs for example). I should be able to query all data together or by section.
Thank!
source to share
in my opinion you should use an hourly basis to store the data in s3 and then use a pipeline to schedule the mr job to clean up the data.
after you clear the data you can store in a location in s3 and then you can run the data pipeline clockwise for 1 hour against your MR pipeline to redshift that process data.
Therefore, at 3pm you will have 3 hours of processed data in s3 and 2 hours processed in redshift dB.
To do this, you can have 1 machine dedicated to executing pipelines, and on that machine, you can define a shell script / perl / python, or thus a script to load the data into your db. You can use AWS formatting for year, month, date, hour, etc. eg,
{(minusHours format (@scheduledStartTime, 2), 'YYYY')} / mm = # {(minusHours format (@scheduledStartTime, 2), 'MM')} / dd = # {(minusHours format (@scheduledStartTime, 2) , 'dd')} / hh = # {(format minusHours (@scheduledStartTime, 2), 'HH')} / *
source to share