Automatically split Hive tables based on S3 directory names

I have data stored in S3, for example:

/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN

/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...

      

My understanding is that if I pull this data through Hive, it will automatically interpret date

as a section. The table creation looks like this:

CREATE EXTERNAL TABLE search_input(
   col 1 STRING,
   col 2 STRING,
   ...

)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';

      

However, Hive does not recognize any data. Any queries I run return 0 results. If I instead take one of the dates via:

CREATE EXTERNAL TABLE search_input_20140701(
   col 1 STRING,
   col 2 STRING,
   ...

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';

      

I can query the data just fine.

Why doesn't Hive recognize subdirectories with the "date = date_str" section? Is there a better way to get Hive to do a query across multiple subdirectories and slice it based on the datetime string?

+3


source to share


1 answer


To get this to work, I had to do 2 things:

  • Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

      

  • For some reason, it still doesn't recognize my partitions, so I had to restore them via:


ALTER TABLE search_input RECOVER PARTITIONS;

      

You can use:

SHOW PARTITIONS table;

      

to verify that they have been restored.

+4


source







All Articles