Hive 0.13 in square brackets

Has anyone in history done map sharing successfully in Hive 13? There is limited documentation on BMJ ( http://www.openkb.info/2014/11/understanding-hive-joins-in-explain.html#.VVuZZpNViko and https://cwiki.apache.org/confluence/download/attachments /27362054/Hive%2BSummit%2B2011-join.pdf ). They seem great, but I cannot get them to work.

If I understood correctly, I should run my command with the following parameters:

SET hive.optimize.bucketmapjoin=true;

      

This will cause Hive to automatically convert your lazy, super slow, plain old gear, joins to a flexible (slender even?), Quickly split, card-merged map. Maybe I am missing some parameters?

My tables are both correctly positioned. They are both housed in 1024 buckets (might that be too much?). The small table has a bucket size of about 15 MB and the large table has an average bucket size of 10 GB or more.

Here my command looks like this:

SET hive.optimize.bucketmapjoin=true;

SELECT
   EV.advertiser_id,
   EV.pixel_id,
   RDB.segment_id
FROM 
   rakuten_segment_index2_bucketed_event_logs_ddp EV
   JOIN rakuten_sir_bucketed_rdb RDB
      ON EV.mm_uuid = RDB.mm_uuid
WHERE 
   EV.days = 90 AND RDB.days = 90;

      

This command results in a slow and steady gearbox connection. It works fine, but very slow.

It should be noted that these tables are separate. I'm only interested in the join of these tables where the partition value (days) = 90. Could the fact that these tables are partitioned mess up the merged map join?

I also tried to set the map connection prompt explicitly with /*+MAPJOIN(EV)*/

. This forces a normal mapjoin, not a bucketed mapjoin. The bush tries to load the entire small (er) table into the hashmap and will run out of memory pretty quickly.

+3


source to share


1 answer


I was able to get this to work eventually. Seems to have something to do with hive.mapjoin.bucket.cache.size

.

Try setting this to a higher number than the default (10 or 100 I think). My work went fine:



SET hive.optimize.bucketmapjoin=true;
set hive.exec.reducers.max = 30000;
set hive.mapjoin.bucket.cache.size=100000000;

SELECT /*+MAPJOIN(EV)*/
   EV.advertiser_id,
   EV.pixel_id,
   RDB.segment_id
FROM 
   tbl_a EV
   JOIN tbl_b RDB
      ON EV.mm_uuid = RDB.mm_uuid
WHERE 
   EV.days = 90 AND RDB.days = 90;

      

Note: A hint is also required mapjoin

.

+1


source







All Articles