Grouping a query into a group and a subgroup
I want to group my data using SQL or R to get the top or bottom 10 Subarea_codes
for each Company
and Area_code
. Essentially: Subarea_codes
inside Area_codes
, where everyone Company
has the largest or smallest score.
data.csv
Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10
result.csv should be like this
Company Area_code Largest_subarea_code Result Smallest_subarea_code Result
A 10 101 15 102 10
P 10 101 10 102 8
C 10 102 5 101 4
A 11 111 15 112 10
P 11 111 20 112 5
C 11 112 10 111 5
There Area_code
could be hundreds inside each Subarea_codes
, but I only want top and bottom 10 for each company.
Also, it should not be allowed in one request, but it can be split into two requests, i.e. the smallest is represented in results_10_smallest and the largest in result_10_largest. But I hope I can accomplish this with one query for each result.
What I have tried:
SELECT Company, Area_code, Subarea_code MAX(Result)
AS Max_result
FROM data
GROUP BY Subarea_code
ORDER BY Company
;
This gives me everything Companies
with the highest results in every Subarea_code. This would mean: A, A, P, AC for the data above.
source to share
Package usage sqldf
:
df <- read.table(text="Area_code Subarea_code Company Result
10 101 A 15
10 101 P 10
10 101 C 4
10 102 A 10
10 102 P 8
10 102 C 5
11 111 A 15
11 111 P 20
11 111 C 5
11 112 A 10
11 112 P 5
11 112 C 10", header=TRUE)
library(sqldf)
mymax <- sqldf("select Company,
Area_code,
max(Subarea_code) Largest_subarea_code
from df
group by Company,Area_code")
mymaxres <- sqldf("select d.Company,
d.Area_code,
m.Largest_subarea_code,
d.Result
from df d, mymax m
where d.Company=m.Company and
d.Subarea_code=m.Largest_subarea_code")
mymin <- sqldf("select Company,
Area_code,
min(Subarea_code) Smallest_subarea_code
from df
group by Company,Area_code")
myminres <- sqldf("select d.Company,
d.Area_code,
m.Smallest_subarea_code,
d.Result
from df d, mymin m
where d.Company=m.Company and
d.Subarea_code=m.Smallest_subarea_code")
result <- sqldf("select a.*, b.Smallest_subarea_code,b.Result
from mymaxres a, myminres b
where a.Company=b.Company and
a.Area_code=b.Area_code")
source to share
If you're already doing this in R, why not use a much more efficient one data.table
instead of sqldf
using SQL syntax? Assuming that data
is your dataset, simply:
library(data.table)
setDT(data)[, list(Largest_subarea_code = Subarea_code[which.max(Result)],
Resultmax = max(Result),
Smallest_subarea_code = Subarea_code[which.min(Result)],
Resultmin = min(Result)), by = list(Company, Area_code)]
# Company Area_code Largest_subarea_code Resultmax Smallest_subarea_code Resultmin
# 1: A 10 101 15 102 10
# 2: P 10 101 10 102 8
# 3: C 10 102 5 101 4
# 4: A 11 111 15 112 10
# 5: P 11 111 20 112 5
# 6: C 11 112 10 111 5
source to share
There seems to be a discrepancy between the result shown and the description. The description asks for 10th and bottom 10 results for each Area code / company, but the sample output only shows the top 1 and bottom 1. For example, for Region Code 10 and Subarea 101 of Company A, the top is 15 and Subarea 102 is the second a value with a result of 10, so as described there should be two rows for this company / area code combination. (If there were more data, there would be up to 10 rows for this combination of company and region combinations.)
We give two answers. The first assumes that the top 10 and bottom 10 are required for each company and area code, as in the question description, and the second assumes only the top and bottom for each company and area code, as in the output of the question.
1) Top / Bottom 10
Here we assume that the 10th and bottom 10 results for each company / area code are required. If it's just top and bottom then see (2) later (or replace 10 with 1 in the code here). Bottom10
are all rows for which there are 10 or fewer subareas for the same area and company code with equal or less results. Top10
is similar.
library(sqldf)
Bottom10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Bottom_Subarea,
a.Result Bottom_Result,
count(*) Bottom_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result <= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
Top10 <- sqldf("select a.Company,
a.Area_code,
a.Subarea_code Top_Subarea,
a.Result Top_Result,
count(*) Top_Rank
from df a join df b
on a.Company = b.Company and
a.Area_code = B.Area_code and
b.Result >= a.Result
group by a.Company, a.Area_code, a.Subarea_code
having count(*) <= 10")
The description states that you want to get the top 10 or bottom 10 for each company / region code, in which case just use one of the above results. If you would like to combine them, we show the merge below. We added a Rank column to indicate the lowest / highest (Rank is 1), the second lowest / highest (Rank is 2), etc.
sqldf("select t.Area_code,
t.Company,
t.Top_Rank Rank,
t.Top_Subarea,
t.Top_Result,
b.Bottom_Subarea,
b.Bottom_Result
from Bottom10 b join Top10 t
on t.Area_code = b.Area_code and
t.Company = b.Company and
t.Top_Rank = b.Bottom_Rank
order by t.Area_code, t.Company, t.Top_Rank")
giving:
Area_code Company Rank Top_Subarea Top_Result Bottom_Subarea Bottom_Result
1 10 A 1 101 15 102 10
2 10 A 2 102 10 101 15
3 10 C 1 102 5 101 4
4 10 C 2 101 4 102 5
5 10 P 1 101 10 102 8
6 10 P 2 102 8 101 10
7 11 A 1 111 15 112 10
8 11 A 2 112 10 111 15
9 11 C 1 112 10 111 5
10 11 C 2 111 5 112 10
11 11 P 1 111 20 112 5
12 11 P 2 112 5 111 20
Please note that this format has less sense if there are connections and can actually generate more than 10 lines of code Company / Area, so you may just want to use the individual Top10
, and Bottom10
in this case, you can also consider the jitter df$Result
, if this is the problem:
df$Result <- jitter(df$Result)
# now perform SQL statements
2) Only top / bottom
Here we only list the top and bottom results and the corresponding subareas for each company / area code. Note that this uses an extension for SQL supported by sqlite and the SQL is significantly simpler:
Bottom1 <- sqldf("select Company,
Area_code,
Subarea_code Bottom_Subarea,
min(Result) Bottom_Result
from df
group by Company, Area_code")
Top1 <- sqldf("select Company,
Area_code,
Subarea_code Top_Subarea,
max(Result) Top_Result
from df
group by Company, Area_code")
sqldf("select a.Company,
a.Area_code,
Top_Subarea,
Top_Result,
Bottom_Subarea
Bottom_Result
from Top1 a join Bottom1 b
on a.Company = b.Company and
a.Area_code = b.Area_code
order by a.Area_code, a.Company")
This gives:
Company Area_code Top_Subarea Top_Result Bottom_Result
1 A 10 101 15 102
2 C 10 102 5 101
3 P 10 101 10 102
4 A 11 111 15 112
5 C 11 112 10 111
6 P 11 111 20 112
Update Hotfix and added (2).
source to share
In this script, the user announces the company. The script then specifies the top 10 maximum results (idem for the minimum values).
Result=NULL
A <- read.table(/your-file.txt",header=T,sep="\t",na.string="NA")
Company<-A$Company=="A" #can be A, C, P or other values
Subarea<-unique(A$Subarea)
for (i in 1:length(unique(A$Subarea)))
{Result[i]<-max(A$Result[Company & A$Subarea_code==Subarea[i]])}
Res1<-t((rbind(Subarea,Result)))
Res2<-Res1[order(-Res1[,2]),]
Res2[1:10,]
source to share