BigQuery SPLIT () and grouping by result

Using SPLIT()

and NTH()

, I strip the string value and take the 2nd substring as the result. Then I want to group by this result. However, when I use SPLIT () in combination with GROUP BY, it keeps giving an error:

Error: (L1:55): Cannot group by an aggregate

      

The result is a string, so why can't it be grouped?

For example, this works and returns the correct string:

SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10

      

enter image description here

But then grouping by result doesn't work:

SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] GROUP BY second_part limit 10

      

enter image description here

+3


source to share


4 answers


My best guess is that you can get an equivalent result using a subquery. Something like:

SELECT * FROM (Select NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10) GROUP BY second_part 

      



The system returns Nth in aggregate I suppose

+5


source


If there are always only two values ​​separated by a divisor, then a simpler approach is to use REGEXP_EXTRACT:



SELECT REGEXP_EXTRACT('FIRST-SECOND','-(.*)') as second_part 
from [FOO.bar] 
GROUP BY second_part 
limit 10

      

+4


source


I like David's answer - sometimes splitting can get a little tricky with RegEx. Extracting the first option from the split command, GROUPING BY is a very common operation. As I usually do in BigQuery, use REGEXP_EXTRACT like this:

In this simple example, the "splitme" column is split into rows (|).

SELECT REGEXP_EXTRACT(splitme, r'(?U)^(.*)\|') AS title, COUNT(*) as c
FROM [my_table]
GROUP BY title;

      

This means cut the line from the beginning of "splitme" to the first occurrence in the pipe (|). The "(? U)" is the "un-greedy" match flag in the re2 RegEx engine syntax. Without this flag, if there are multiple values ​​separated by channels, this RegEx will match all the way down to the last channel.

+1


source


In my practice, I usually use something like below, where N is the number of values ​​in the "list" to skip.

SELECT REGEXP_EXTRACT(string + '|',  r'(?U)^(?:.*\|){N}(.*)\|') AS substring 

      

So, if I was interested in the third value in the list, I would use:

SELECT 
  REGEXP_EXTRACT(string + '|',  r'(?U)^(?:.*\|){2}(.*)\|') AS substring,
  COUNT(1) AS weight
FROM yourtable
GROUP BY 1

      

More on re2 syntax here

0


source







All Articles