Summation of the levels of multiple factorial variables
I have searched for similar questions but cannot find the exact solution. This question is somewhat similar, but only deals with the question of summing several continuous variables, not factors.
I have dataframe, consisting of a 4-factor variables ( sex
, agegroup
, hiv
, group
), for example
set.seed(20150710)
df<-data.frame(sex=as.factor(c(sample(1:2, 10000, replace=T))),
agegroup=as.factor(c(sample(1:5,10000, replace=T))),
hiv=as.factor(c(sample(1:3,10000, replace=T))),
group=as.factor(c(sample(1:2,10000, replace=T)))
)
levels(df$sex)<- c("Male", "Female")
levels(df$agegroup)<- c("16-24", "25-34", "35-44", "45-54", "55+")
levels(df$hiv)<-c("Positive", "Negative", "Not tested")
levels(df$group)<-c("Intervention", "Control")
I would like to create a pivot table giving counts and proportions for each level of exposure variables sex
, agegroup
and hiv
, stratified group
.
EDIT: This is what I was aiming for:
X N_Control Percent_Control N_Intervention Percent_Intervention
1 sex_Female 2517 0.5041057 2480 0.4953066
2 sex_Male 2476 0.4958943 2527 0.5046934
3 agegroup_16-24 1005 0.2012818 992 0.1981226
4 agegroup_25-34 1001 0.2004807 996 0.1989215
5 agegroup_35-44 1010 0.2022832 997 0.1991212
6 agegroup_45-54 976 0.1954737 996 0.1989215
7 agegroup_55+ 1001 0.2004807 1026 0.2049131
8 hiv_Negative 1679 0.3362708 1642 0.3279409
9 hiv_Not tested 1633 0.3270579 1660 0.3315359
10 hiv_Positive 1681 0.3366713 1705 0.3405233
But I cannot get it to work with summarise_each
dplyr; only general values ββand proportions of variables are given, and not for each level of factors:
df.out<-df %>%
group_by(group) %>%
summarise_each(funs(N=n(), Percent=n()/sum(n())), sex, agegroup, hiv)
print(df.out)
group sex_N agegroup_N hiv_N sex_Percent agegroup_Percent hiv_Percent
1 1 4973 4973 4973 1 1 1
2 2 5027 5027 5027 1 1 1
Finally, is there a way to modify the table (using tidyr for example) so that exposure variables (gender, age group, hiv) are represented as strings?
thank
source to share
Doing this in two steps will give you the desired result. Calculate first n
, then calculate the percentage by group
:
library(dplyr)
df.out <- df %>%
group_by(group, sex, agegroup, hiv) %>%
tally() %>%
group_by(group) %>%
mutate(percent=n/sum(n))
Solution with data.table
:
library(data.table)
dt.out <- setDT(df)[, .N, by=.(group, sex, agegroup, hiv)][, percent:=N/sum(N), by=group]
library(microbenchmark)
microbenchmark(df.out = df %>%
group_by(group, sex, agegroup, hiv) %>%
tally() %>%
group_by(group) %>%
mutate(percent=n/sum(n)),
dt.out = df[,.N,by=.(group, sex, agegroup, hiv)][,percent:=N/sum(N),by=group])
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# df.out 8.299870 8.518590 8.894504 8.708315 8.931459 11.964930 100 b
# dt.out 2.346632 2.394788 2.540132 2.441777 2.551235 4.344442 100 a
Conclusion: The solution data.table
is much faster (3.5x).
To get the table you requested after editing your question, you can do the following:
library(data.table)
setDT(df)
dt.sex <- dcast(df[,.N, by=.(sex,group)][,percent:=N/sum(N)], sex ~ group, value.var = c("N", "percent"))
dt.age <- dcast(df[,.N, by=.(agegroup,group)][,percent:=N/sum(N)], agegroup ~ group, value.var = c("N", "percent"))
dt.hiv <- dcast(df[,.N, by=.(hiv,group)][,percent:=N/sum(N)], hiv ~ group, value.var = c("N", "percent"))
dt.out.wide <- rbindlist(list(dt.sex, dt.age, dt.hiv), use.names=FALSE)
names(dt.out.wide) <- c("X","N_Intervention","N_Control","percent_Intervention","percent_Control")
this gives:
> dt.out.wide
X N_Intervention N_Control percent_Intervention percent_Control
1: Male 2454 2488 0.2454 0.2488
2: Female 2561 2497 0.2561 0.2497
3: 16-24 954 991 0.0954 0.0991
4: 25-34 1033 1002 0.1033 0.1002
5: 35-44 1051 1000 0.1051 0.1000
6: 45-54 983 978 0.0983 0.0978
7: 55+ 994 1014 0.0994 0.1014
8: Positive 1717 1664 0.1717 0.1664
9: Negative 1637 1659 0.1637 0.1659
10: Not tested 1661 1662 0.1661 0.1662
source to share