GROUP BY clause is applied after WHERE clause in Hive?

Question

GROUP BY clause is applied after WHERE clause in Hive?

Suppose I have the following SQL:

select user_group, count(*)
from table
where user_group is not null
group by user_group

Let's further assume that 99% of the data has a zero user_group.

Will this discard strings with a null value prior to GROUP BY, or will one bad reducer end up with 99% of the rows that will later be discarded?

I hope this is the first one. It makes sense.

Bonus points if you say what will happen with the Hive version. We use 0.11 and go to 0.13.

Bonus points if you can provide any supporting documentation.

+3

hive hiveql

dfrankow May 27 '15 at 5:02

source to share

2 answers

This query discards NULL rows before the GROUP BY clause. Hope this link is helpful: -

http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.2.0/bk_dataintegration/content/hive-013-feature-subqueries-in-where-clauses.html

0

madhu May 27 '15 @ 6:13 am

source to share

Kishore kumar suthar · Accepted Answer · 2015-05-27T06:21:39+0000

Sequence

FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
SELECT
ORDER BY arranges the remaining rows/groups

The first step is always the FROM clause. In your case, this is pretty straight forward, because there is only one table and there are no complicated joins to worry about. In a query with joins, they are evaluated in this first step. Joins are collected to determine which rows to retrieve, with the ON clause conditions being the criteria for determining which rows should be joined to each table. The result of the FROM clause is an intermediate result. You can think of it as a temporary table made up of combined rows that satisfy all join conditions. (In your case, the temporary table is not actually built because the optimizer knows that it can just access your table directly, without joining any other.)

The next step is the WHERE clause. In a query with a WHERE clause, each row of the intermediate result is evaluated according to the WHERE clauses and is either discarded or retained. This way null will be discarded before going into Group by clause

Next, GROUP BY appears. If there is a GROUP BY clause, the intermediate result is now split into groups, one group for each combination of values in the columns in the GROUP BY clause.

Now the HAVING clause appears. The HAVING clause works once in each group, and all rows from the groups that do not meet the HAVING clause are removed.

Next, SELECT appears. From the rows of the new intermediate result created by the GROUP BY and HAVING clauses, SELECT now gathers the columns it needs.

Finally, the last step is the ORDER BY clause.

GROUP BY clause is applied after WHERE clause in Hive?

More articles: