How to load a hive table with [structs] map from another flat / plain hive table

I have 2 tables in a hive that have an Order and an Order_Detail (with a 1: n relationship and attached to the order_id), which I am trying to load into one table, taking advantage of the hive data type - map [struct].

Tell ORDER below data,

General_Computer_Order

123 10.00 1

456 12.00 2

and ORDER_DETAILS have

Order_id Order_Item_id Item_amount Item_type

123 1 5.00 A

123 2 5.00 B

456 1 6.00 A

456 2 3.00 B

456 3 3.00 C

I would like to create single ORDERS tables with all order columns and order_detail columns as structure map. It helps me to combine related data and queries together, thus avoiding frequent joins. I have tried loading a table with complex datatypes using txt / json input files with the corresponding serde and it works well. But in this scenario, I want to load data from existing 2 hive ORCFile tables into a new table. Have tried some basic insertion using the named_struct function, but it loads each row separately and doesn't merge the same order_id into one row.

Expected Result:

123 10.00 1 [1: {5.00, A}, 2: {5.00, B}]

456 12.00 2 {1: {6.00, A}, 2: {3.00, B}, 3: {3.00, C}]

but i get,

123 10.00 1 [1: {5.00, A}]

123 10.00 1 [2: {5.00, B}]

456 12.00 2 {1: {6.00, A}]

456 12.00 2 {2: {3.00, B}]

456 12.00 2 {3: {3.00, C}]

Please help me figure out how to achieve this using only INSERT INTO of 2 tables. Thanks in advance.

+3


source to share


1 answer


I found a way to do this using the map, named_struct functions and a custom UDF to_map posted by David Worms on the to_map UDF blog . Here's an example,

CREATE TABLE ORDER(
  order_id bigint,
  total_amount bigint,
  customer bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

CREATE TABLE ORDER_DETAILS(
  order_id bigint,
  Order_Item_id bigint,
  Item_amount bigint,
  Item_type string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

CREATE TABLE ORDERS(
  order_id bigint,
  Order_Items map < bigint, struct < Item_amount: bigint, Item_type: string >> ,
  total_amount bigint,
  customer bigint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

Insert overwrite table ORDERS
select
a.order_id,
  a.order_items,
  b.total_amount,
  b.customer
from
  (select order_id as order_id,
    to_map(order_item_id, named_struct("item_amount", item_amount, "item_type", item_type)) as order_items from ORDER_DETAILS group by order_id) a
JOIN ORDER b ON(a.order_id = b.order_id);
      

Run codeHide result


select * from ORDERS;



123 {1: {"Item_amount": 5, "Item_type": "A"}, 2: {"Item_amount": 5, "Item_type": "B"}} 10 1

456 {1: {"Item_amount": 6, "Item_type": "A"}, 2: {"Item_amount": 3, "Item_type": "B"}, 3: {"Item_amount": 3, "Item_type" : "C"}} 12 2

Hope this helps everyone.

+3


source







All Articles