Converting big long data to wide R range
I need help converting my 1558810 x 84 long data to 1558810 x 4784 wide data
Let me explain in detail how and why. My original data is as follows: Data has three main columns -
id empId dept
1 a social
2 a Hist
3 a math
4 b comp
5 a social
6 b comp
7 c math
8 c Hist
9 b math
10 a comp
id is a unique key that tells which employee went to which faculty at the university for the day. I need this to change as shown below.
id empId dept social Hist math comp
1 a social 1 0 0 0
2 a Hist 0 1 0 0
3 a math 0 0 1 0
4 b comp 0 0 0 1
5 a social 1 0 0 0
6 b comp 0 0 0 1
7 c math 0 0 1 0
8 c Hist 0 1 0 0
9 b math 0 0 1 0
10 a comp 0 0 0 1
I have two datasets, one with 49k rows and one with 1.55 million rows. For a smaller dataset that had 1100 unique department values, I used dcast in the reshape2 package to get the dataset I wanted (so the converted data would have 3 + 1100 columns and 49k rows). But when I use the same function on my larger dataset that has 4700 unique department values, my R crashes due to a memory issue. I tried alternative alternatives like xtabs, reshape, etc., but every time it failed with a memory error.
Now I have resorted to a rough FOR loop -
columns <- unique(ds$dept)
for(i in 1:length(unique(ds$dept)))
{
ds[,columns[i]] <- ifelse(ds$dept==columns[i],1,0)
}
But this is very slow and the code has been running for 10 hours. Is there a workaround for this that I'm missing?
ANY suggestions would be very helpful!
source to share
You may try
df$dept <- factor(df$dept, levels=unique(df$dept))
res <- cbind(df, model.matrix(~ 0+dept, df))
colnames(res) <- gsub("dept(?=[A-Za-z])", "", colnames(res), perl=TRUE)
res
# id empId dept social Hist math comp
#1 1 a social 1 0 0 0
#2 2 a Hist 0 1 0 0
#3 3 a math 0 0 1 0
#4 4 b comp 0 0 0 1
#5 5 a social 1 0 0 0
#6 6 b comp 0 0 0 1
#7 7 c math 0 0 1 0
#8 8 c Hist 0 1 0 0
#9 9 b math 0 0 1 0
#10 10 a comp 0 0 0 1
Or you can try
cbind(df, as.data.frame.matrix(table(df[,c(1,3)])))
Or using data.table
library(data.table)
setDT(df)
dcast.data.table(df, id + empId + dept ~ dept, fun=length)
Or using qdap
library(qdap)
cbind(df, as.wfm(with(df, mtabulate(setNames(dept, id)))))
data
df <- structure(list(id = 1:10, empId = c("a", "a", "a", "b", "a",
"b", "c", "c", "b", "a"), dept = c("social", "Hist", "math",
"comp", "social", "comp", "math", "Hist", "math", "comp")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c(NA, -10L))
source to share
Try:
> cbind(dd[1:3], dcast(dd, dd$id~dd$dept, length)[-1])
Using dept as value column: use value.var to override.
id empId dept comp Hist math social
1 1 a social 0 0 0 1
2 2 a Hist 0 1 0 0
3 3 a math 0 0 1 0
4 4 b comp 1 0 0 0
5 5 a social 0 0 0 1
6 6 b comp 1 0 0 0
7 7 c math 0 0 1 0
8 8 c Hist 0 1 0 0
9 9 b math 0 0 1 0
10 10 a comp 1 0 0 0
Data:
> dput(dd)
structure(list(id = 1:10, empId = structure(c(1L, 1L, 1L, 2L,
1L, 2L, 3L, 3L, 2L, 1L), .Label = c("a", "b", "c"), class = "factor"),
dept = structure(c(4L, 2L, 3L, 1L, 4L, 1L, 3L, 2L, 3L, 1L
), .Label = c("comp", "Hist", "math", "social"), class = "factor")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))
source to share