Automatically split Hive tables based on S3 directory names
I have data stored in S3, for example:
/bucket/date=20140701/file1 /bucket/date=20140701/file2 ... /bucket/date=20140701/fileN /bucket/date=20140702/file1 /bucket/date=20140702/file2 ... /bucket/date=20140702/fileN ...
My understanding is that if I pull this data through Hive, it will automatically interpret date
as a section. The table creation looks like this:
CREATE EXTERNAL TABLE search_input(
col 1 STRING,
col 2 STRING,
...
)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';
However, Hive does not recognize any data. Any queries I run return 0 results. If I instead take one of the dates via:
CREATE EXTERNAL TABLE search_input_20140701(
col 1 STRING,
col 2 STRING,
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';
I can query the data just fine.
Why doesn't Hive recognize subdirectories with the "date = date_str" section? Is there a better way to get Hive to do a query across multiple subdirectories and slice it based on the datetime string?
+3
source to share
1 answer
To get this to work, I had to do 2 things:
- Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
- For some reason, it still doesn't recognize my partitions, so I had to restore them via:
ALTER TABLE search_input RECOVER PARTITIONS;
You can use:
SHOW PARTITIONS table;
to verify that they have been restored.
+4
source to share