Bad field name Athena AWS and multiple folders with Hive DDL
I am new to AWS Athena and I am trying to request multiple S3 codes containing JSON files. I ran into a number of problems that have no answer in the documentation (unfortunately, their error log is not informative enough to try and solve it myself):
- How do I request a JSON field with a name in brackets? For example, I have a "Capacity (GB)" field, and when I try to include in a CREATE EXTERNAL statement, I get an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table ( `device`: string, `Capacity(GB)`: string)
Your request has the following errors:
FAILED: Runtime error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Error :: expected at position 'Capacity (GB): string>' but '(' found.
-
My files are in subfolders in S3 in the following structure:
'LOCATION_NAME / YYYY / MM / DD / appstring /'
and I want to query all dates of a specific row of the application (out of many). is there any "wildcard" I can use to replace the date path? Something like that:
LOCATION 's3://location_name/%/%/%/appstring/'
- Do I need to load the raw data as is using CREATE EXTERNAL TABLE and only then query it, or can I add some WHERE statements to build? In particular, the following is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What will be the results in terms of billing? Reason right now I am creating this CREATE statement only to reuse the data in the SQL query once.
Thank!
source to share
1. JSON field with name in brackets
There is no need to create a field named Capacity(GB)
. Instead, create a field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON , you can use the Serde property mapping
(which I saw with Hive Serde, which has nested structures nested in it ):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works well with input:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot substitute the middle track ( s3://location_name/*/*/*/appstring/
). The closest option is to use partitioned data , but your directories will need a different naming format.
3. Creating tables
You cannot specify operators WHERE
as part of a statement CREATE TABLE
.
If your goal is to lower your data costs, use partitioned data to reduce the number of files scanned or stored in a format column such as Parquet.
Examples: Analyzing Data in S3 Using Amazon Athena
source to share