Multiple column subset on .N in large data.table
I have a large data.table
one that contains millions of rows and 30 columns. The columns contain a varying number of categorical features. I would like to remove any functionality that happens less than a certain fraction.
Here's an example:
dt <- data.table(id=1:18,col1=c(rep("a",5), rep("b",10), rep("c",3)), col2=c(rep("d",12),rep("e",5),"f"))
dt
id col1 col2
1: 1 a d
2: 2 a d
3: 3 a d
4: 4 a d
5: 5 a d
6: 6 b d
7: 7 b d
8: 8 b d
9: 9 b d
10: 10 b d
11: 11 b d
12: 12 b d
13: 13 b e
14: 14 b e
15: 15 b e
16: 16 c e
17: 17 c e
18: 18 c f
For example, I only want to keep functions that are more than 0.5 columns long:
> dt[,.N,by=col1][N/sum(N)>0.5]
col1 N
1: b 10
and
> dt[,.N,by=col2][N/sum(N)>0.5]
col2 N
1: d 12
What I came up with was to iterate over the columns and use %in%
for (i in 1:2) dt[, paste0('newcol',i) :=lapply(dt[[paste0('col',i)]],
function(y) ifelse(y %in% dt[,.N,by=dt[[paste0('col',i)]]][N/sum(N)>0.5][[1]],y,"") )]
and then create a new column that contains the merged values dt[, merge := paste(newcol1,newcol2), by=id]
which gets me my desired result in the merge column :
> dt
id col1 col2 newcol1 newcol2 merge
1: 1 a d d d
2: 2 a d d d
3: 3 a d d d
4: 4 a d d d
5: 5 a d d d
6: 6 b d b d b d
7: 7 b d b d b d
8: 8 b d b d b d
9: 9 b d b d b d
10: 10 b d b d b d
11: 11 b d b d b d
12: 12 b d b d b d
13: 13 b e b b
14: 14 b e b b
15: 15 b e b b
16: 16 c e
17: 17 c e
18: 18 c f
The problem is that it is really slow on a large dataset. I suspect I don't fit this in "data.table-y". I also have to be very careful not to copy the original dataset because it barely fits in my RAM, which is why it is data.table
so attractive in the first place. I don't care if there are any intermediate steps as long as the process is faster.
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.2
loaded via a namespace (and not attached):
[1] plyr_1.8.1 Rcpp_0.11.2 reshape2_1.4 stringr_0.6.2 tools_3.0.2
source to share
Hope I get it this time. Too much time wasted on wrong answers!
cols = paste("col", 1:2, sep="")
rows = nrow(dt)
for (i in seq_along(cols)) {
dt[, (cols[i]) := if (.N/rows <= .5) "", by=c(cols[i])]
}
dt[, merge := do.call(paste, c(as.list(dt)[-1L], sep= " "))]
Here's the benchmark in columns 1e6 * 30:
set.seed(1L)
dt = setDT(lapply(1:30, function(x) sample(letters[1:4], 1e6, TRUE)))
system.time({
cols = paste("V", 1:30, sep="")
rows = nrow(dt)
for (i in seq_along(cols)) {
dt[, (cols[i]) := if (.N/rows <= .5) "", by=c(cols[i])]
}
dt[, merge := do.call(paste, c(as.list(dt)[-1L], sep= " "))]
})
# user system elapsed
# 4.880 0.086 5.095
I'll let you pick up possible further optimizations. Good luck!
source to share