Filter rows by function by values of each row, data.table

Question

Filter rows by function by values of each row, data.table

Switching from data.frame syntax to data.table syntax still isn't smooth for me. I thought the following should be trivial, but no. What am I doing wrong here:

> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
> DT
   x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9

I need something like this:

cols = c("y", "v") # a vector of column names or indexes
DT[rowSums(cols) > 5] # Take only rows where
# values at colums y and v satisfy a condition. 'rowSums' here is just an
# example it can be any function that return TRUE or FALSE when applied 
# to values of the row.

This works, but what if I want to provide dynamic column names? and my tables have many columns?

>DT[eval( quote(y + v > 5))] #and the following command gives the same result
> DT[y + v > 5]
   x y v
1: a 6 3
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9
> DT[lapply(.SD, sum) > 5, .SDcols = 2:3] # Want the same result as above
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, ,.SDcols = 2:3]
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, , .SDcols = c("y", "v")]
Empty data.table (0 rows) of 3 cols: x,y,v

Update after answers Since it turns out there are many ways to do this, I want to see which one is the best performer. Below is a sample sync code:

nr = 1e7
DT = data.table(x=sample(c("a","b","c"),nr, replace= T),
                y=sample(2:5, nr, replace = T), v=sample(1:9, nr, T))
threshold = 5
cols = c("y", "v")
col.ids = 2:3
filter.methods = 'DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]]
DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]]
DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x]
DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]
'
fm <- strsplit(filter.methods, "\n")
fm <- unlist(fm)
timing = data.frame()
rn = NULL
for (e in sample(fm, length(fm))) { 
  # Seen some weird pattern with first item in 'fm', so scramble it
  rn <- c(rn, e)
  if (e == "DT[Reduce(`+`,eval(cols))>threshold]") {
    cols = quote(list(y, v))
  } else {
    cols = c("y", "v")
  }
  tm <- system.time(eval(parse(text = e)))
  timing <- rbind(timing, 
                  data.frame(
                    as.list(tm[c("user.self", "sys.self", "elapsed")])
                    )
                  )
}
rownames(timing) <- rn
timing[order(timing$elapsed),]

### OUTPUT ####
#                                                                     user.self sys.self elapsed
# DT[Reduce(`+`,eval(cols))>threshold]                                   0.416    0.168   0.581
# DT[Reduce(`+`, mget(cols)) > threshold]                                0.412    0.172   0.582
# DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]]                     0.572    0.316   0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]]                  0.568    0.320   0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]]              0.576    0.316   0.890
# DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]]               0.648    0.404   1.052
# DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]]                   0.688    0.368   1.052
# DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]]                0.612    0.440   1.053
# DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]]                  0.692    0.368   1.058
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x]     0.800    0.448   1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x]         0.836    0.412   1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x]            0.836    0.416   1.249

Thus, the speed champion:

DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]

I prefer one of mine mget

. And I think the reason is that others are slower because they name rowSums

, whereas it Reduce

only helps to shape the expression. Sincere thanks to everyone who gave the answers. I find it difficult to decide, for me to choose the answer "accept". Reduce

- very specific to this operation sum

, but rowSums

- an example of using an arbitrary function.

+3

r data.table

biocyberman 12 Aug 14 at 11:18

source to share

3 answers

David Arenburg · Answer 1 · 2014-08-12T12:23:01+0000

cols = c("y", "v")

Try

DT[DT[, rowSums(.SD[, cols, with = F]) > 5]]

or

DT[DT[, rowSums(.SD[, 2:3, with = F]) > 5]]

or

DT[DT[, rowSums(.SD) > 5, .SDcols = cols]]

or

DT[DT[, rowSums(.SD) > 5, .SDcols = c("y", "v")]]

or

DT[DT[, rowSums(.SD) > 5, .SDcols = 2:3]]

or

DT[ ,.SD[rowSums(.SD[, 2:3, with = F]) > 5]]

or

DT[ ,.SD[rowSums(.SD[, cols, with = F]) > 5]]

or

DT[, .SD[rowSums(.SD) > 5], .SDcols = cols, by = x]

or

DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3, by = x]

or

DT[, .SD[rowSums(.SD) > 5], .SDcols = c("y", "v"), by = x]

Each result will be

#    x y v
# 1: a 6 3
# 2: b 3 5
# 3: b 6 6
# 4: c 1 7
# 5: c 3 8
# 6: c 6 9

Some explanations:

.SD

is also an object data.table

that can work in the field DT

. So this line DT[ ,rowSums(.SD[, cols, with = F]) > 5]

will return a boolean vector indicating in which cases it DT

has y + v > 5

. So we will add another one DT

to select those indices insideDT
When you use .SDcols

, it will limit to .SD

only those columns. This way, if you only do something like DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3]

, you will lose the column x

the way it was added by = x

.
Another option when using .SDcols

is to return a boolean vector and then insert it into anotherDT

Shambho · Answer 2 · 2014-08-13T01:56:07+0000

Here's another possibility:

cols <- quote(list(y, v))
DT[Reduce(`+`,eval(cols))>5]

Or, if you prefer to store cols

as a character vector:

cols <- c('y', 'v')
DT[Reduce(`+`, mget(cols)) > 5]

akrun · Answer 3 · 2014-08-12T12:21:15+0000

One of the methods:

cols <- quote(list(y, v))
DT[DT[,Reduce(`+`,eval(cols))>5]]
#    x y v
# 1: a 6 3
# 2: b 3 5
# 3: b 6 6
# 4: c 1 7
# 5: c 3 8
# 6: c 6 9

Filter rows by function by values ​​of each row, data.table

More articles:

Filter rows by function by values of each row, data.table