Get frequency across columns in R

I have data in a dataframe in this format:

  grp1 grp2 grp3 grp4 result
1    0    1    0    0      1
2    1    0    0    0      0
3    0    0    0    1      1
4    0    0    0    1      1
5    1    0    0    0      0
6    0    1    0    0      1
.
.
.

      

What can be generated with

set.seed(13)

groups <- c("grp1", "grp2", "grp3", "grp4", "result")

# Randomly assign each to group and a result
x <- do.call(rbind, lapply(1:50, function(x) c(sample(c(1,0,0,0), 4), sample(0:1, 1))))
df <- data.frame(x)
colnames(df) <- groups

      

My goal is to format the data:

  group      freq
1  grp1 0.5625000
2  grp2 0.5000000
3  grp3 0.6250000
4  grp4 0.2857143

      

Where frequency is the percentage of each group that has a result.

My attempt so far with dplyr:

library(dplyr)

df %>% 
  group_by(grp1, grp2, grp3, grp4, result) %>% 
  summarize(n = n()) %>% 
  mutate(freq = n / sum(n)) %>%
  select(-n) %>%
  filter(result == 1)

      

leads to

  grp1 grp2 grp3 grp4 result      freq
1    0    0    0    1      1 0.5625000
2    0    0    1    0      1 0.5000000
3    0    1    0    0      1 0.6250000
4    1    0    0    0      1 0.2857143

      

+3


source to share


5 answers


"Frequency is the percentage of each group that has a result." I am assuming you mean the percentage of rows where each group is equal to the result.



df %>% 
    tidyr::gather(key = group, value = group_choice, grp1:grp4) %>%
    group_by(group) %>%
    filter(group_choice == 1) %>%
    summarize(freq = mean(group_choice == result))

# Source: local data frame [4 x 2]
# 
#   group      freq
# 1  grp1 0.2857143
# 2  grp2 0.6250000
# 3  grp3 0.5000000
# 4  grp4 0.5625000

      

+4


source


Here's a data.table

try



library(data.table)
melt(setDT(df), "result")[, .(freq = sum(value[result == 1])/sum(value)), by = variable]
#    variable      freq
# 1:     grp1 0.2857143
# 2:     grp2 0.6250000
# 3:     grp3 0.5000000
# 4:     grp4 0.5625000

      

+6


source


You can also use apply

:

> freq=apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})
> data.frame(freq)
#             freq
# grp1   0.2857143
# grp2   0.6250000
# grp3   0.5000000
# grp4   0.5625000
# result 1.0000000

      

As suggested by @akrun you can also:

summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()

      

In this case, apply

seems to be the fastest solution:

akrun=function(df)    {summarise_each(df,funs( sum(.==1 & df$result==1)/sum(.))) %>% t()}
user7598=function(df) {apply(df,2,function(x){sum(x==1 & df$result==1)/sum(x)})}
David=function(df)    {melt(setDT(df), "result")[, .(freq = sum(value[result == 1])/sum(value)), by = variable]}
Gregor=function(df)   {df %>% tidyr::gather(key = group, value = group_choice, grp1:grp4) %>% group_by(group) %>% filter(group_choice == 1) %>% summarize(freq = mean(group_choice == result))}

# SPEED TESTS
set.seed(5)
microbenchmark(akrun(df), Gregor(df),user7598(df),David(df))
Unit: microseconds
         expr       min         lq       mean    median         uq       max neval cld
    akrun(df)  9645.860 10509.3940 12690.5538 10848.248 12315.4020 98239.948   100   c
   Gregor(df) 10319.888 11405.6060 12512.9027 11685.120 12237.1120 26211.999   100   c
 user7598(df)   423.662   491.7045   630.8143   563.958   629.8315  2027.243   100   a  
    David(df)  2115.610  2273.5525  2622.7699  2348.005  2475.2295 15491.534   100   b 

      

Check out the changes based on the OP's comments in @ Gregor's answer.

+3


source


If I understand correctly, you want to know the percentage of each group with "1", assuming "Result" is 1. If so, you can use the apply () function to sum the columns, then divide them by the column length. You can apply a constraint that "Result" must be 1 by specifying in the data frame.

Note that in the dataframe specification below, I say that R should only be applied to the first four columns, as the Result column does not need to be included in the frequency calculation.

namely: df [conditional operator, c (1: 4)]

result <- data.frame(apply(df[df$result == 1, c(1:4)], 2,sum)/apply(df[df$result==1, c(1:4) ], 2, length))
colnames(result)<- c("freq")

      

This produces results in the format below

      freq
grp1 0.1818182
grp2 0.1818182
grp3 0.3636364
grp4 0.2727273

      

0


source


I think colSums()

here:

rci <- which(names(df)=='result');
data.frame(group=names(df[-rci]),freq=unname(colSums(df[-rci]==1&df[,rci]==1)/colSums(df[-rci])));
##   group      freq
## 1  grp1 0.2857143
## 2  grp2 0.6250000
## 3  grp3 0.5000000
## 4  grp4 0.5625000

      

0


source







All Articles