Saving a pig output to a Hive table in one copy
I would like to insert pig output into Hive tables (tables in Hive are already created with exact schema). You just need to insert the output values into the table. I don't want the usual method where I first store in a file, then read that file from Hive and then insert into tables. I need to reduce this extra jump to be done.
Is it possible. If so, please tell me how can this be done?
thank
source to share
Ok. Create an outer hive table with schema layout somewhere in the HDFS directory. Let's admit
create external table emp_records(id int,
name String,
city String)
row formatted delimited
fields terminated by '|'
location '/user/cloudera/outputfiles/usecase1';
Just create the table as above and you don't need to upload any file to that directory.
Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script like below
A = LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');
Make sure that the trailing location, separator and final FOREACH clause in your Pigscript match the DHL schema.
source to share
There are two approaches described below with an example of an "Employee" table for storing pig data in a beehive table. (The prerequisite is that the hive table must have already been created)
A = LOAD 'EMPLOYEE.txt' USING PigStorage(',') AS(EMP_NUM:int,EMP_NAME:chararray,EMP_PHONE:int);
Approach 1: Using Hcatalog
// dump pig result to Hive using Hcatalog
store A into 'Empdb.employee' using org.apache.hive.hcatalog.pig.HCatStorer();
(or)
Approach 2: Using HDFS Physical Location
// dump pig result to external hive warehouse location
STORE A INTO 'hdfs://<<nmhost>>:<<port>>/user/hive/warehouse/Empdb/employee/' USING PigStorage(',')
;
source to share