Grouping a query into a group and a subgroup

Question

Grouping a query into a group and a subgroup

I want to group my data using SQL or R to get the top or bottom 10 Subarea_codes

for each Company

and Area_code

. Essentially: Subarea_codes

inside Area_codes

, where everyone Company

has the largest or smallest score.

data.csv

Area_code  Subarea_code  Company   Result
10         101           A         15
10         101           P         10
10         101           C         4
10         102           A         10
10         102           P         8
10         102           C         5
11         111           A         15
11         111           P         20
11         111           C         5
11         112           A         10
11         112           P         5
11         112           C         10


result.csv should be like this

Company   Area_code  Largest_subarea_code  Result  Smallest_subarea_code    Result
A         10         101                   15      102                      10
P         10         101                   10      102                      8            
C         10         102                   5       101                      4
A         11         111                   15      112                      10
P         11         111                   20      112                      5
C         11         112                   10      111                      5

There Area_code

could be hundreds inside each Subarea_codes

, but I only want top and bottom 10 for each company.

Also, it should not be allowed in one request, but it can be split into two requests, i.e. the smallest is represented in results_10_smallest and the largest in result_10_largest. But I hope I can accomplish this with one query for each result.

What I have tried:

SELECT Company, Area_code, Subarea_code MAX(Result) 
    AS Max_result
FROM data
GROUP BY Subarea_code
ORDER BY Company
;

This gives me everything Companies

with the highest results in every Subarea_code. This would mean: A, A, P, AC for the data above.

+3

sql sqlite r greatest-n-per-group

user3471881 08 Sep 14 at 9:08

source to share

5 answers

If you're already doing this in R, why not use a much more efficient one data.table

instead of sqldf

using SQL syntax? Assuming that data

is your dataset, simply:

library(data.table)
setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)],
            Resultmax = max(Result),
            Smallest_subarea_code = Subarea_code[which.min(Result)],
            Resultmin = min(Result)), by = list(Company, Area_code)]
#    Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin
# 1:       A        10                  101        15                   102        10
# 2:       P        10                  101        10                   102         8
# 3:       C        10                  102         5                   101         4
# 4:       A        11                  111        15                   112        10
# 5:       P        11                  111        20                   112         5
# 6:       C        11                  112        10                   111         5

+5

David Arenburg 08 Sep 14 at 11:24

source to share

There seems to be a discrepancy between the result shown and the description. The description asks for 10th and bottom 10 results for each Area code / company, but the sample output only shows the top 1 and bottom 1. For example, for Region Code 10 and Subarea 101 of Company A, the top is 15 and Subarea 102 is the second a value with a result of 10, so as described there should be two rows for this company / area code combination. (If there were more data, there would be up to 10 rows for this combination of company and region combinations.)

We give two answers. The first assumes that the top 10 and bottom 10 are required for each company and area code, as in the question description, and the second assumes only the top and bottom for each company and area code, as in the output of the question.

1) Top / Bottom 10

Here we assume that the 10th and bottom 10 results for each company / area code are required. If it's just top and bottom then see (2) later (or replace 10 with 1 in the code here). Bottom10

are all rows for which there are 10 or fewer subareas for the same area and company code with equal or less results. Top10

is similar.

library(sqldf)

Bottom10 <- sqldf("select a.Company, 
                          a.Area_code, 
                          a.Subarea_code Bottom_Subarea, 
                          a.Result Bottom_Result,
                          count(*) Bottom_Rank
        from df a join df b  
        on a.Company = b.Company and 
           a.Area_code = B.Area_code and
           b.Result <= a.Result
        group by a.Company, a.Area_code, a.Subarea_code
        having count(*) <= 10")

Top10 <- sqldf("select a.Company, 
                       a.Area_code, 
                       a.Subarea_code Top_Subarea, 
                       a.Result Top_Result,
                       count(*) Top_Rank
        from df a join df b  
        on a.Company = b.Company and 
           a.Area_code = B.Area_code and 
           b.Result >= a.Result
        group by a.Company, a.Area_code, a.Subarea_code
        having count(*) <= 10")

The description states that you want to get the top 10 or bottom 10 for each company / region code, in which case just use one of the above results. If you would like to combine them, we show the merge below. We added a Rank column to indicate the lowest / highest (Rank is 1), the second lowest / highest (Rank is 2), etc.

sqldf("select t.Area_code, 
              t.Company, 
              t.Top_Rank Rank,
              t.Top_Subarea, 
              t.Top_Result,
              b.Bottom_Subarea,
              b.Bottom_Result
       from Bottom10 b join Top10 t
       on t.Area_code = b.Area_code and 
          t.Company = b.Company and
          t.Top_Rank = b.Bottom_Rank
       order by t.Area_code, t.Company, t.Top_Rank")

giving:

   Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result
1         10       A    1         101         15            102            10
2         10       A    2         102         10            101            15
3         10       C    1         102          5            101             4
4         10       C    2         101          4            102             5
5         10       P    1         101         10            102             8
6         10       P    2         102          8            101            10
7         11       A    1         111         15            112            10
8         11       A    2         112         10            111            15
9         11       C    1         112         10            111             5
10        11       C    2         111          5            112            10
11        11       P    1         111         20            112             5
12        11       P    2         112          5            111            20

Please note that this format has less sense if there are connections and can actually generate more than 10 lines of code Company / Area, so you may just want to use the individual Top10

, and Bottom10

in this case, you can also consider the jitter df$Result

, if this is the problem:

df$Result <- jitter(df$Result)
# now perform SQL statements

2) Only top / bottom

Here we only list the top and bottom results and the corresponding subareas for each company / area code. Note that this uses an extension for SQL supported by sqlite and the SQL is significantly simpler:

Bottom1 <- sqldf("select Company, 
                          Area_code, 
                          Subarea_code Bottom_Subarea, 
                          min(Result) Bottom_Result
        from df
        group by Company, Area_code")

Top1 <- sqldf("select Company, 
                      Area_code, 
                      Subarea_code Top_Subarea, 
                      max(Result) Top_Result
        from df
        group by Company, Area_code")

sqldf("select a.Company, 
              a.Area_code, 
              Top_Subarea, 
              Top_Result,
              Bottom_Subarea
              Bottom_Result
        from Top1 a join Bottom1 b  
        on a.Company = b.Company and 
           a.Area_code = b.Area_code
        order by a.Area_code, a.Company")

This gives:

  Company Area_code Top_Subarea Top_Result Bottom_Result
1       A        10         101         15           102
2       C        10         102          5           101
3       P        10         101         10           102
4       A        11         111         15           112
5       C        11         112         10           111
6       P        11         111         20           112

Update Hotfix and added (2).

+1

G. Grothendieck 08 Sep 14 at 14:11

source to share

The above answers are enough to get the maximum result.

This solves the top10 problem:

data.top <- data[ave(-data$Result, data$Company, data$Area_code, FUN = rank) <= 10, ]

0

user3471881 08 Sep 14 at 13:14

source to share

In this script, the user announces the company. The script then specifies the top 10 maximum results (idem for the minimum values).

Result=NULL
A <- read.table(/your-file.txt",header=T,sep="\t",na.string="NA")
Company<-A$Company=="A" #can be A, C, P or other values

Subarea<-unique(A$Subarea)

for (i in 1:length(unique(A$Subarea)))
{Result[i]<-max(A$Result[Company & A$Subarea_code==Subarea[i]])}
Res1<-t((rbind(Subarea,Result)))
Res2<-Res1[order(-Res1[,2]),]
Res2[1:10,]

0

Ruthger Righart 08 Sep 14 at 15:23

source to share

zx8754 · Accepted Answer · 2014-09-08T10:02:06+0000

Package usage sqldf

:

df <- read.table(text="Area_code  Subarea_code  Company   Result
10         101           A         15
10         101           P         10
10         101           C         4
10         102           A         10
10         102           P         8
10         102           C         5
11         111           A         15
11         111           P         20
11         111           C         5
11         112           A         10
11         112           P         5
11         112           C         10", header=TRUE)

library(sqldf)
mymax <- sqldf("select Company,
                  Area_code,
                  max(Subarea_code) Largest_subarea_code
                  from df
                  group by Company,Area_code")
mymaxres <- sqldf("select d.Company,
                          d.Area_code,
                          m.Largest_subarea_code,
                          d.Result
                  from df d, mymax m
                  where d.Company=m.Company and
                        d.Subarea_code=m.Largest_subarea_code")

mymin <- sqldf("select Company,
                  Area_code,
                  min(Subarea_code) Smallest_subarea_code
                  from df
                  group by Company,Area_code")
myminres <- sqldf("select d.Company,
                          d.Area_code,
                          m.Smallest_subarea_code,
                          d.Result
                  from df d, mymin m
                  where d.Company=m.Company and
                        d.Subarea_code=m.Smallest_subarea_code")
result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result
                from mymaxres a, myminres b
                where a.Company=b.Company and 
                      a.Area_code=b.Area_code")

Grouping a query into a group and a subgroup

More articles: