BigQuery SPLIT () and grouping by result
Using SPLIT()
and NTH()
, I strip the string value and take the 2nd substring as the result. Then I want to group by this result. However, when I use SPLIT () in combination with GROUP BY, it keeps giving an error:
Error: (L1:55): Cannot group by an aggregate
The result is a string, so why can't it be grouped?
For example, this works and returns the correct string:
SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10
But then grouping by result doesn't work:
SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] GROUP BY second_part limit 10
source to share
I like David's answer - sometimes splitting can get a little tricky with RegEx. Extracting the first option from the split command, GROUPING BY is a very common operation. As I usually do in BigQuery, use REGEXP_EXTRACT like this:
In this simple example, the "splitme" column is split into rows (|).
SELECT REGEXP_EXTRACT(splitme, r'(?U)^(.*)\|') AS title, COUNT(*) as c
FROM [my_table]
GROUP BY title;
This means cut the line from the beginning of "splitme" to the first occurrence in the pipe (|). The "(? U)" is the "un-greedy" match flag in the re2 RegEx engine syntax. Without this flag, if there are multiple values separated by channels, this RegEx will match all the way down to the last channel.
source to share
In my practice, I usually use something like below, where N is the number of values in the "list" to skip.
SELECT REGEXP_EXTRACT(string + '|', r'(?U)^(?:.*\|){N}(.*)\|') AS substring
So, if I was interested in the third value in the list, I would use:
SELECT
REGEXP_EXTRACT(string + '|', r'(?U)^(?:.*\|){2}(.*)\|') AS substring,
COUNT(1) AS weight
FROM yourtable
GROUP BY 1
More on re2 syntax here
source to share