Automatically split Hive tables based on S3 directory names

Question

Automatically split Hive tables based on S3 directory names

I have data stored in S3, for example:

/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN

/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...

My understanding is that if I pull this data through Hive, it will automatically interpret date

as a section. The table creation looks like this:

CREATE EXTERNAL TABLE search_input(
   col 1 STRING,
   col 2 STRING,
   ...

)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';

However, Hive does not recognize any data. Any queries I run return 0 results. If I instead take one of the dates via:

CREATE EXTERNAL TABLE search_input_20140701(
   col 1 STRING,
   col 2 STRING,
   ...

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';

I can query the data just fine.

Why doesn't Hive recognize subdirectories with the "date = date_str" section? Is there a better way to get Hive to do a query across multiple subdirectories and slice it based on the datetime string?

+3

amazon-s3 hive

gallamine 04 Aug 14 at 20:47

source to share

1 answer

gallamine · Answer 1 · 2014-08-04T21:35:10+0000

To get this to work, I had to do 2 things:

Enable recursive directory support:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

For some reason, it still doesn't recognize my partitions, so I had to restore them via:

ALTER TABLE search_input RECOVER PARTITIONS;

You can use:

SHOW PARTITIONS table;

to verify that they have been restored.

Automatically split Hive tables based on S3 directory names

More articles: