Bad field name Athena AWS and multiple folders with Hive DDL

I am new to AWS Athena and I am trying to request multiple S3 codes containing JSON files. I ran into a number of problems that have no answer in the documentation (unfortunately, their error log is not informative enough to try and solve it myself):

  • How do I request a JSON field with a name in brackets? For example, I have a "Capacity (GB)" field, and when I try to include in a CREATE EXTERNAL statement, I get an error:
   CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
  `device`: string,
  `Capacity(GB)`: string)

      

Your request has the following errors:

FAILED: Runtime error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Error :: expected at position 'Capacity (GB): string>' but '(' found.

  1. My files are in subfolders in S3 in the following structure:

    'LOCATION_NAME / YYYY / MM / DD / appstring /'

and I want to query all dates of a specific row of the application (out of many). is there any "wildcard" I can use to replace the date path? Something like that:

LOCATION 's3://location_name/%/%/%/appstring/'

  1. Do I need to load the raw data as is using CREATE EXTERNAL TABLE and only then query it, or can I add some WHERE statements to build? In particular, the following is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
  field1:string,
  field2:string
  )

ROW FORMAT SERDE  'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'

WHERE field2='value'

      

What will be the results in terms of billing? Reason right now I am creating this CREATE statement only to reuse the data in the SQL query once.

Thank!

+3


source to share


1 answer


1. JSON field with name in brackets

There is no need to create a field named Capacity(GB)

. Instead, create a field with a different name:

CREATE EXTERNAL TABLE test_table (
    device string,
    capacity string
)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';

      

If you are using Nested JSON , you can use the Serde property mapping

(which I saw with Hive Serde, which has nested structures nested in it ):

CREATE external TABLE test_table (
   top string,
   inner struct<device:INT,
               capacity:INT>
   )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';

      

This works well with input:

{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}

      



2. Subfolders

You cannot substitute the middle track ( s3://location_name/*/*/*/appstring/

). The closest option is to use partitioned data , but your directories will need a different naming format.

3. Creating tables

You cannot specify operators WHERE

as part of a statement CREATE TABLE

.

If your goal is to lower your data costs, use partitioned data to reduce the number of files scanned or stored in a format column such as Parquet.

Examples: Analyzing Data in S3 Using Amazon Athena

+2


source







All Articles