S3 query registers content using Athena or DynamoDB

I have a use case for requesting a url request from S3 logs. Amazon recently introduced Athena to query the contents of an S3 file. What's the best option in terms of cost and performance?

  • Use Athena to Request S3 Files for URL Requests
  • Store the metadata of each file with the request url information in a DynamoDB table for the request
+3


source to share


3 answers


Amazon DynamoDB is a poor choice for making weblog queries.

DynamoDB is super fast, but only if you retrieve data based on its primary key (" Request "). If you run a query with all the data in a table (for example, to find a specific IP address in a key that is not indexed), DynamoDB will need to scan ALL rows in the table, which takes a long time (" Scan "). For example, if your spreadsheet is set to 100 Reads per second and you scan 10,000 rows, it will take 100 seconds (100 x 100 = 10,000).

Tip: Do not perform a full screen scan on a NoSQL database.

Amazon Athena is perfect for scanning log files! There is no need to preload data - just run a query against the logs already stored in Amazon S3. Use standard SQL to find the data you are looking for. Plus, you only pay for the data that is read from the disk. The file format is a bit weird, so you need the correct operator CREATE TABLE

.



See: Using AWS Athena to Request S3 Server Access Logs

Another choice is to use Amazon Redshift , which can contain GB, TB, and even PB data over billions of rows. If you're going to run frequent queries against log data, Redshift is great. However, being a standard SQL database, you need to preload the data into Redshift. Unfortunately, Amazon S3 log files are not in CSV format, so you will need to use ETL files in a suitable format. It doesn't make sense for casual, ad hoc queries.

Many people also like to use Amazon Elasticsearch Service to scan log files. Again, the file format requires special handling and loading the data requires some work, but the result is an interactive real-time analysis of your S3 files.

See: Using ELK Stack to Parse Your S3 Logs

+5


source


Athena vs. DynamoDB: if functionally you can achieve your requirement with both; then:

  • DynamoDB will be many times faster than Athena.
  • DynamoDB will be more expensive than Athena. In DynamoDB, you pay the IOPS provided; while in Athens you pay ONLY when requested (otherwise you only pay s3 storage cost).


Hence, if you rarely need to query your data, Athena is the best solution for DynamoDB. Plus, if performance is important, DynamoDB is the answer. Also, if you already have TB of data in S3; then Athena is the solution, since why bother loading it into DynamoDB, which would be a bombshell (unless you want to get query results in milliseconds or seconds).

+1


source


As Deepak said, DynamoDB is faster, but the cost is higher than Athena. Depending on your use case, implementing a hybrid approach solution might give you good results in certain scenarios.

You can use DynamoDB to store the latest data and read heavy data. Old, read low-cost data can be stored in S3 and used by Athena to query over it.

However, implementation wise it will be a bit tricky.

0


source







All Articles