Fastest way to do subset in R

I have a named main

dataframe that contains 400,000 rows and I want to multiply it to retrieve 1 or more rows.

As an example, a data frame is shown that shows the type of a subset. I am using the function subset


main <- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00 GMT","2015-02-02 09:46:00 GMT")),
                   name= c("bob","george"),
                   id= c(5,2))

subset(main, date == "2015-01-01 07:44:00" & name == "bob" & value == 1)


It works, but it is slow and I think this is because I am working with a 400k row dataframe. Any ideas how to make the subset faster?


source to share

1 answer

I would suggest using keyed data.table

. Here's how to set it up (for a modified example):

mainDT <- data.table(main)


We can now subset based on equality conditions using syntax like






which subsets where V1 %in% c("a","b")

(equivalent to V1=="a"|V1=="b"


Here's a speed comparison:

  "["       = main[main$V1=="a" & main$V2=="A",],
  "subset"  = subset(main,V1=="a" & V2=="A"),
  "DT[J()]" = mainDT[J("a","A")],


which gives these results on my computer:

     test replications elapsed relative user.self sys.self
1       [            5    5.96       NA      5.38     0.57
3 DT[J()]            5    0.00       NA      0.00     0.00
2  subset            5    6.93       NA      6.20     0.72


So the subset of c J

is instant, while the other two methods take a few seconds. However, the subset with is J

limited to:

  • This is only for equality conditions.
  • For the simple syntax above, you need to pass the arguments in key order. However, you can choose where V1=="a" & V3 == 2

    using mainDT[J("a",unique(V2),2)]

    , and still pretty fast.

Anything you can do with data.frame can also be done with data.table. For example, subset(mainDT,V1=="a" & V2=="A")

it still works. This way, there is nothing lost when switching data.frames to data.tables generally. You can convert to data table with setDT(main)


Here's the code for an example:

n  = 1e7
n3 = 1e3

main <- data.frame(


The improvement shown above will differ from your data. If you have many observations ( n

) or multiple unique values ​​for the keys (for example n3

), the subset advantage with a keyed data table should be greater.



All Articles