Get frequency across columns in R
I have data in a dataframe in this format:
grp1 grp2 grp3 grp4 result
1 0 1 0 0 1
2 1 0 0 0 0
3 0 0 0 1 1
4 0 0 0 1 1
5 1 0 0 0 0
6 0 1 0 0 1
.
.
.
What can be generated with
set.seed(13)
groups <- c("grp1", "grp2", "grp3", "grp4", "result")
# Randomly assign each to group and a result
x <- do.call(rbind, lapply(1:50, function(x) c(sample(c(1,0,0,0), 4), sample(0:1, 1))))
df <- data.frame(x)
colnames(df) <- groups
My goal is to format the data:
group freq
1 grp1 0.5625000
2 grp2 0.5000000
3 grp3 0.6250000
4 grp4 0.2857143
Where frequency is the percentage of each group that has a result.
My attempt so far with dplyr:
library(dplyr)
df %>%
group_by(grp1, grp2, grp3, grp4, result) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n)) %>%
select(-n) %>%
filter(result == 1)
leads to
grp1 grp2 grp3 grp4 result freq
1 0 0 0 1 1 0.5625000
2 0 0 1 0 1 0.5000000
3 0 1 0 0 1 0.6250000
4 1 0 0 0 1 0.2857143
source to share
"Frequency is the percentage of each group that has a result." I am assuming you mean the percentage of rows where each group is equal to the result.
df %>%
tidyr::gather(key = group, value = group_choice, grp1:grp4) %>%
group_by(group) %>%
filter(group_choice == 1) %>%
summarize(freq = mean(group_choice == result))
# Source: local data frame [4 x 2]
#
# group freq
# 1 grp1 0.2857143
# 2 grp2 0.6250000
# 3 grp3 0.5000000
# 4 grp4 0.5625000
source to share
You can also use apply
:
> freq=apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})
> data.frame(freq)
# freq
# grp1 0.2857143
# grp2 0.6250000
# grp3 0.5000000
# grp4 0.5625000
# result 1.0000000
As suggested by @akrun you can also:
summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()
In this case, apply
seems to be the fastest solution:
akrun=function(df) {summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()}
user7598=function(df) {apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})}
David=function(df) {melt(setDT(df), "result")[, .(freq = sum(value[result == 1])/sum(value)), by = variable]}
Gregor=function(df) {df %>% tidyr::gather(key = group, value = group_choice, grp1:grp4) %>% group_by(group) %>% filter(group_choice == 1) %>% summarize(freq = mean(group_choice == result))}
# SPEED TESTS
set.seed(5)
microbenchmark(akrun(df), Gregor(df),user7598(df),David(df))
Unit: microseconds
expr min lq mean median uq max neval cld
akrun(df) 9645.860 10509.3940 12690.5538 10848.248 12315.4020 98239.948 100 c
Gregor(df) 10319.888 11405.6060 12512.9027 11685.120 12237.1120 26211.999 100 c
user7598(df) 423.662 491.7045 630.8143 563.958 629.8315 2027.243 100 a
David(df) 2115.610 2273.5525 2622.7699 2348.005 2475.2295 15491.534 100 b
Check out the changes based on the OP's comments in @ Gregor's answer.
source to share
If I understand correctly, you want to know the percentage of each group with "1", assuming "Result" is 1. If so, you can use the apply () function to sum the columns, then divide them by the column length. You can apply a constraint that "Result" must be 1 by specifying in the data frame.
Note that in the dataframe specification below, I say that R should only be applied to the first four columns, as the Result column does not need to be included in the frequency calculation.
namely: df [conditional operator, c (1: 4)]
result <- data.frame(apply(df[df$result == 1, c(1:4)], 2,sum)/apply(df[df$result==1, c(1:4) ], 2, length))
colnames(result)<- c("freq")
This produces results in the format below
freq
grp1 0.1818182
grp2 0.1818182
grp3 0.3636364
grp4 0.2727273
source to share