Calculation on multiple columns and aggregates using multiple factors
My data looks like this:
df <- data.frame(Price=seq(1, 1.5, 0.1),
Sales=seq(6, 1, -1),
Quality=c('A','A','A','B','B','B'),
Brand=c('F','P','P','P','F','F'))
Sometimes I need to do a complex calculation on multiple columns and aggregates on multiple factors. For a simplified example, if I want to get the distribution Revenue (= Price * Sales)
inside each Quality
and split into Brand
, I would do
df$Revenue <- df$Price*df$Sales
RevSumByQ <- aggregate(Revenue~Quality, data=df, sum)
colnames(RevSumByQ)[2] <- "RevSumByQ"
df <- merge(df, RevSumByQ)
RevSumWithinQByB <- aggregate(RevSumByQ~Brand, data=df, sum)
colnames(RevSumWithinQByB)[2] <- "RevSumWithinQByB"
df <- merge(df, RevSumWithinQByB)
df$RevDistWithinQByB = df$RevSumByQ/df$RevSumWithinQByB
df
Brand Quality Price Sales Revenue RevSumByQ RevSumWithinQByB RevDistWithinQByB
1 F A 1.0 6 6.0 16.3 32.7 0.4984709
2 F B 1.4 2 2.8 8.2 32.7 0.2507645
3 F B 1.5 1 1.5 8.2 32.7 0.2507645
4 P A 1.1 5 5.5 16.3 40.8 0.3995098
5 P A 1.2 4 4.8 16.3 40.8 0.3995098
6 P B 1.3 3 3.9 8.2 40.8 0.2009804
If shown in the plot:
require(ggplot2)
ggplot(data=df, aes(x=Brand, y=RevDistWithinQByB, fill=Quality)) + geom_bar(stat='identity')
There must be a better way to draw this plot, but my main interest is in getting a dataframe with less intermediate results ( Revenue, RevSumByQ, RevSumWithinQByB
). I see the structure in my approach, so I wonder if there are more elegant solutions or if there are some features that make this task easier.
source to share
Here a data.table
:
library(data.table)
setDT(df)
##
df[,Revenue:=Price*Sales][
,RevSumByQ:=sum(Revenue),
by=Quality][
,RevSumWithinQByB:=sum(RevSumByQ),
by=Brand][
,RevDistWithinQByB:=RevSumByQ/RevSumWithinQByB]
And while I don't usually do this myself, you can call your code ggplot
from within the same object:
df[,Revenue:=Price*Sales][
,RevSumByQ:=sum(Revenue),
by=Quality][
,RevSumWithinQByB:=sum(RevSumByQ),
by=Brand][
,RevDistWithinQByB:=RevSumByQ/RevSumWithinQByB][
,{print(ggplot(
data=.SD,
aes(x=Brand,
y=RevDistWithinQByB,
fill=Quality))+
geom_bar(stat="identity"))}]
source to share
Basically (as @arun pointed out) you don't need the merges here and you can do everything using ave
from the R base. It also seems like it would be hard to skip the first two steps of the aggregation. Although you can skip the last calculation and put it right in ggplot
. Something like:
df$Revenue <- df$Price*df$Sales
df$RevSumByQ <- with(df, ave(Revenue, Quality, FUN = sum))
df$RevSumWithinQByB <- with(df, ave(RevSumByQ, Brand, FUN = sum))
require(ggplot2)
ggplot(data = df,
aes(x = Brand, y = RevSumByQ/RevSumWithinQByB, fill = Quality)) +
geom_bar(stat = 'identity')
source to share