Can I use the python mrjob library for split tables hive?

Question

Can I use the python mrjob library for split tables hive?

I have custom access to hadoop server / cluster containing data that is stored exclusively in partitioned tables / files in hive (avro). I was wondering if I can mapreduce with python mrjob on these tables? So far I've tested mrjob locally on text files stored on CDH5 and I'm impressed with the ease of development.

After some research I found that there is a library called HCatalog, but as far as I know it is not available for python (Java only). Unfortunately, I don't have much time to learn Java and would like to stick with Python.

Do you know of any way to run mrjob on stored hive data?

If this is not possible, is there a way to pass the python-written mapreduce code for the hive? (I would rather not load python mapreduce files in hive)

+3

python hadoop hive streaming mrjob

Tomasz Sosiński 17 Sep 14 at 11:57

source to share

1 answer

Tomasz Sosiński · Answer 1 · 2014-10-15T08:13:14+0000

As Alex currently said, Mr. Job does not work with files created in Avro. However, there is a way to directly execute python code on hive tables (no need for Mr.Job, unfortunately with a loss of flexibility). In the end I was able to add the python file as a resource for the hive by doing "ADD FILE mapper.py" and executing the SELECT clause with TRANSFORM ... USING .... while storing the mapping results in a separate table. An example of a request for a hive:

INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;

A complete example is available here (below): link

Can I use the python mrjob library for split tables hive?

More articles: