How to load a hive table with [structs] map from another flat / plain hive table
I have 2 tables in a hive that have an Order and an Order_Detail (with a 1: n relationship and attached to the order_id), which I am trying to load into one table, taking advantage of the hive data type - map [struct].
Tell ORDER below data,
General_Computer_Order
123 10.00 1
456 12.00 2
and ORDER_DETAILS have
Order_id Order_Item_id Item_amount Item_type
123 1 5.00 A
123 2 5.00 B
456 1 6.00 A
456 2 3.00 B
456 3 3.00 C
I would like to create single ORDERS tables with all order columns and order_detail columns as structure map. It helps me to combine related data and queries together, thus avoiding frequent joins. I have tried loading a table with complex datatypes using txt / json input files with the corresponding serde and it works well. But in this scenario, I want to load data from existing 2 hive ORCFile tables into a new table. Have tried some basic insertion using the named_struct function, but it loads each row separately and doesn't merge the same order_id into one row.
Expected Result:
123 10.00 1 [1: {5.00, A}, 2: {5.00, B}]
456 12.00 2 {1: {6.00, A}, 2: {3.00, B}, 3: {3.00, C}]
but i get,
123 10.00 1 [1: {5.00, A}]
123 10.00 1 [2: {5.00, B}]
456 12.00 2 {1: {6.00, A}]
456 12.00 2 {2: {3.00, B}]
456 12.00 2 {3: {3.00, C}]
Please help me figure out how to achieve this using only INSERT INTO of 2 tables. Thanks in advance.
source to share
I found a way to do this using the map, named_struct functions and a custom UDF to_map posted by David Worms on the to_map UDF blog . Here's an example,
CREATE TABLE ORDER(
order_id bigint,
total_amount bigint,
customer bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
CREATE TABLE ORDER_DETAILS(
order_id bigint,
Order_Item_id bigint,
Item_amount bigint,
Item_type string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
CREATE TABLE ORDERS(
order_id bigint,
Order_Items map < bigint, struct < Item_amount: bigint, Item_type: string >> ,
total_amount bigint,
customer bigint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
Insert overwrite table ORDERS
select
a.order_id,
a.order_items,
b.total_amount,
b.customer
from
(select order_id as order_id,
to_map(order_item_id, named_struct("item_amount", item_amount, "item_type", item_type)) as order_items from ORDER_DETAILS group by order_id) a
JOIN ORDER b ON(a.order_id = b.order_id);
select * from ORDERS;
123 {1: {"Item_amount": 5, "Item_type": "A"}, 2: {"Item_amount": 5, "Item_type": "B"}} 10 1
456 {1: {"Item_amount": 6, "Item_type": "A"}, 2: {"Item_amount": 3, "Item_type": "B"}, 3: {"Item_amount": 3, "Item_type" : "C"}} 12 2
Hope this helps everyone.
source to share