BigQuery SPLIT () and grouping by result

Question

BigQuery SPLIT () and grouping by result

Using SPLIT()

and NTH()

, I strip the string value and take the 2nd substring as the result. Then I want to group by this result. However, when I use SPLIT () in combination with GROUP BY, it keeps giving an error:

Error: (L1:55): Cannot group by an aggregate

The result is a string, so why can't it be grouped?

For example, this works and returns the correct string:

SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10

enter image description here

But then grouping by result doesn't work:

SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] GROUP BY second_part limit 10

enter image description here

+3

google-bigquery

Graham polley May 15 '15 at 12:17

source to share

4 answers

If there are always only two values separated by a divisor, then a simpler approach is to use REGEXP_EXTRACT:

SELECT REGEXP_EXTRACT('FIRST-SECOND','-(.*)') as second_part 
from [FOO.bar] 
GROUP BY second_part 
limit 10

+4

David M Smith May 15 '15 at 20:43

source to share

I like David's answer - sometimes splitting can get a little tricky with RegEx. Extracting the first option from the split command, GROUPING BY is a very common operation. As I usually do in BigQuery, use REGEXP_EXTRACT like this:

In this simple example, the "splitme" column is split into rows (|).

SELECT REGEXP_EXTRACT(splitme, r'(?U)^(.*)\|') AS title, COUNT(*) as c
FROM [my_table]
GROUP BY title;

This means cut the line from the beginning of "splitme" to the first occurrence in the pipe (|). The "(? U)" is the "un-greedy" match flag in the re2 RegEx engine syntax. Without this flag, if there are multiple values separated by channels, this RegEx will match all the way down to the last channel.

+1

Michael manoochehri Jan 29. 16 at 18:49

source to share

In my practice, I usually use something like below, where N is the number of values in the "list" to skip.

SELECT REGEXP_EXTRACT(string + '|',  r'(?U)^(?:.*\|){N}(.*)\|') AS substring

So, if I was interested in the third value in the list, I would use:

SELECT 
  REGEXP_EXTRACT(string + '|',  r'(?U)^(?:.*\|){2}(.*)\|') AS substring,
  COUNT(1) AS weight
FROM yourtable
GROUP BY 1

BigQuery SPLIT () and grouping by result

More articles: