Fastest way to do subset in R
I have a named
dataframe that contains 400,000 rows and I want to multiply it to retrieve 1 or more rows.
As an example, a data frame is shown that shows the type of a subset. I am using the function
main <- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00 GMT","2015-02-02 09:46:00 GMT")), name= c("bob","george"), value=c(1,522), id= c(5,2)) subset(main, date == "2015-01-01 07:44:00" & name == "bob" & value == 1)
It works, but it is slow and I think this is because I am working with a 400k row dataframe. Any ideas how to make the subset faster?
source to share
I would suggest using keyed
. Here's how to set it up (for a modified example):
require(data.table) mainDT <- data.table(main) setkey(mainDT,V1,V2,V3)
We can now subset based on equality conditions using syntax like
which subsets where
V1 %in% c("a","b")
Here's a speed comparison:
require(rbenchmark) benchmark( "[" = main[main$V1=="a" & main$V2=="A",], "subset" = subset(main,V1=="a" & V2=="A"), "DT[J()]" = mainDT[J("a","A")], replications=5 )[,1:6]
which gives these results on my computer:
test replications elapsed relative user.self sys.self 1 [ 5 5.96 NA 5.38 0.57 3 DT[J()] 5 0.00 NA 0.00 0.00 2 subset 5 6.93 NA 6.20 0.72
So the subset of c
is instant, while the other two methods take a few seconds. However, the subset with is
- This is only for equality conditions.
- For the simple syntax above, you need to pass the arguments in key order. However, you can choose where
V1=="a" & V3 == 2
, and still pretty fast.
Anything you can do with data.frame can also be done with data.table. For example,
subset(mainDT,V1=="a" & V2=="A")
it still works. This way, there is nothing lost when switching data.frames to data.tables generally. You can convert to data table with
Here's the code for an example:
n = 1e7 n3 = 1e3 set.seed(1) main <- data.frame( V1=sample(letters,n,replace=TRUE), V2=sample(c(letters,LETTERS),n,replace=TRUE), V3=sample(1:n3,n,replace=TRUE), V4=rnorm(n))
The improvement shown above will differ from your data. If you have many observations (
) or multiple unique values for the keys (for example
), the subset advantage with a keyed data table should be greater.
source to share