Problems saving the Hive table by Pig

I am using HCatalog

to read and write data to Hive from Pig Script like this:

A = LOAD 'customer' USING org.apache.hcatalog.pig.HCatLoader();

B = LOAD 'address' USING org.apache.hcatalog.pig.HCatLoader();

C = JOIN A by cmr_id,B by cmr_id;

STORE C INTO 'cmr_address_join' USING org.apache.hcatalog.pig.HCatStorer();

      

Client table definition :

cmr_id                  int                     
name                    string                   

      

Address

addr_id                 int                     
cmr_id                  int                     
address                 string                  

      

cmr_address_join

cmr_id                  int                     
name                    string                  
addr_id                 int                     
address                 string    

      

When I run this, Pig throws the following error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1115: Column names should all be in lowercase. Invalid name found: A::cmr_id

      

I believe it could be because Pig is trying to match the generated Pig file names to the Hive columns and it doesn't quite match ( A::cmr_id versus cmr_id

). I think it HCatalogStorer

expects the alias to be cmr_id

, not A::cmr_id

. I want to HCatalogStorer

ignore the alias prefix and read only the field name.

grunt>  DESCRIBE C;

C: {A::cmr_id: int,A::name: chararray,B::addr_id: int,B::cmr_id: int,B::address: chararray}

      

Is there a way to remove the field prefix in Pig (i.e. A: :)? Or if anyone has a workaround or solution, it would be great.

I know we can use the following to explicitly add an alias and make it work.

D = foreach C generate A::cmr_id as cmr_id,A::name as name, B::addr_id as addr_id, B::address as address;

STORE D INTO 'cmr_address_join' USING org.apache.hcatalog.pig.HCatStorer();

      

But my problem is that I have many tables, each with hundreds of columns. It would be tedious to provide an alias as above.

Any help to fix this would be greatly appreciated.

+3


source to share


2 answers


You can use $ 0, $ 1, etc. to access the columns and please rename them as column name like: $ 0 as cmr_id



+1


source


Yes, there is no joy in this, but you are unlikely to have this exact solution, especially since your connection-related relation will have both join keys in them (example - A :: cmr_id and B :: cmr_id). You have already reached the only real solution; design it appropriately with FOREACH / GENERATE and rename the column names. In practice, you will probably need to do this for real Hive structures, as you will need to have columns that are not only named correctly, but in the correct sequence. Not to mention, it is unlikely that a "real" Hive table would have a join key value stored twice.



The only other solution I can think of (and I do not recommend) is STORE C as a file on HDFS which you have a non-managed (most likely EXTERNAL) Hive table set up to point to the directory you are just saved the file. You can also have a Hive view that pre-generated these sequences, perhaps trim additional columns (like a duplicate cmr_id), columns so you can then execute a new LOAD command with HCatLoader and then use that alias for the HCATStorer STORE command. It might look better in your Pig script, but you still have to do most of the work (only in Hive) and will certainly have a performance impact since you have to write and then read the HDFS file represented by C, before saving it to desired Hive table.

+1


source







All Articles