Can you explain when and why mapreduce is called in the hive

  • select * from Table_name limit 5;

  • select col1_name,col2_name from table_name limit 5;

When I run the first request, MapReduce will not be called whereas MapReduce will be called on the other. Could you explain the reason.

+3


source to share


3 answers


To understand the reason, first we need to know what phase display and shortening means: -

  • Map: Basically a filter that filters and arranges data in sorted order. Ex. It will filter col1_name, col2_name from the string in the second query. However, in the first query, you read each column, no filtering is required. Hence the map phase

  • Reduce . Shrink is just a summary of the transaction data for all rows. eg. amount coloumn! You don't need summary data in both queries. Hence, there is no reducer.



So the 1st request, since there is no map reduction, the second request only has mappers, but no reduction.

+2


source


Take a simple hive request below:

Describe table;

      

This reads data from the hive metastar and is the simplified and fastest query in the hive.

select * from table;

      

This request only requires reading data from HDFS. So far, none of them require any card or phase cuts.

select * from table where color in ('RED','WHITE','BLUE')

      

This request only requires the card, there is no reduction phase. No aggregation function exists. Here we filter the collection of RED, WHITE or BLUE records

select count(1) from table;

      

This request only requires a decrease phase. No matching is required because we are counting all the records in the table. If we want to count the items, we add a render phase before the decrease phase. See below:

Select color
, count(1) as color_count 
  from table  
  group by color;

      



This query has an aggregation function and a group operator. We count the number of elements in the table: RED, WHITE or BLUE. This counting requires a map and cut work.

We are essentially creating a key value pair in the above assignment. We map records to a key. In this case, it will be RED, WHITE and BLUE. Then one value is produced. So key: value - color: 1. Then we can sum the value over the key color. This is a map and cut of work

Now take that same query and order by clause.

Select color
, count(1) as color_count 
  from table  
  group by color
  order by colour_count desc;

      

This adds another downsizing phase and forces a single reducer for the dataset being transmitted. This is necessary because we want to ensure that the global order is maintained. The graph (single color) also forces one reducer and requires a map and decreases the phase.

As you add complexity to your hive query, you similarly add a map and reduce the tasks required to produce the requested results.

If you want to know how the hive will manage the request, you can use the caluse explanation before your request.

 Explain select * from table;

      

This can give you an idea of โ€‹โ€‹how the query is being executed under the hood. It will show you the dependencies of the stages and what if any aggregations lead to job cuts and operators lead to a map job.

+4


source


Its logical.

In the first query, there is only one thing: you need to output data with a limit of 5 (which means that you need to discard any 5 line numbers). Unable to process with specific request type. (other than knowing how strings are split);

but in the second request a map must be specified - a job reduction. What for?? because first it has to process the data to know how many different columns. to know if col1 and col1 actually exist or if it only has one col. if exists than it should eliminate the other columns first and then in the remaining columns it should only take five rows in it

+1


source







All Articles