R data.table with variable number of columns

Question

R data.table with variable number of columns

For each student in a dataset, a specific set of points can be collected. We want to calculate the average for each student, but using only the grades in the columns that were native to that student.

The columns required for the calculation are different for each row. I figured out how to write this in R using normal tools, but I am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success on this small project, which could lead to having to do calculations for many, many lines.

Here is a small working example "select a specific column for each row problem".

set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
              A = sample(c(1:5, 9), 10,  replace = TRUE),
              B = sample(c(1:5, 9), 10, replace = TRUE),
              C = sample(c(1:5, 9), 10, replace = TRUE),
              D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises

## Students in various grades are administered different
## tests.  A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
           "4" = c("A", "C"),
           "5" = c("B", "C", "D"),
           "6" = c("A", "B", "C", "D"),
           "7" = c("C", "D"),
           "8" = c("C"))

## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
    lookup[[as.character(grade)]]
}


## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
    scores <- arow[getLookup(arow["grade"])]
    mean(scores[scores != 9], na.rm = TRUE)
}

stuscores <- apply(dat, 1, function(x) getMean(x, lookup))

result <- data.frame(dat, stuscores)
result

## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.

## Client will want students sorted by state, district, classroom,
## etc.

## However, am stumped on how to specify the adjustable
## column-name chooser

library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)

The desired outcome is the student average for the respective columns, for example:

> result
  id grade A B C D stuscores
1   1     3 9 9 1 4       NaN
2   2     4 5 4 1 5       3.0
3   3     5 1 3 5 9       4.0
4   4     6 5 2 4 5       4.0
5   5     7 9 1 1 3       2.0
6   6     3 3 3 4 3       3.0
7   7     4 9 2 9 2       NaN
8   8     5 3 9 2 9       2.0
9   9     6 2 3 2 5       3.0
10 10     7 3 2 4 1       2.5

What then? I have written many errors so far ...

I have not found examples in the data table examples where the columns to be used in the calculations for each row is itself a variable, thank you for your advice.

I have not asked anyone to write the code for me, I am asking for advice on how to get started with this problem.

+3

r data.table

pauljohn32 Jul 19 15 at 19:28

source to share

2 answers

Here's a different approach data.table

using melt.data.table

( data.table

1.9.5+ required ) and then connecting between data.table

s:

DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
  variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
  .(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
#    id grade A B C D mean_score
# 1:  1     3 9 9 1 4        NaN
# 2:  2     4 5 4 1 5        3.0
# 3:  3     5 1 3 5 9        4.0
# 4:  4     6 5 2 4 5        4.0
# 5:  5     7 9 1 1 3        2.0
# 6:  6     3 3 3 4 3        3.0
# 7:  7     4 9 2 9 2        NaN
# 8:  8     5 3 9 2 9        2.0
# 9:  9     6 2 3 2 5        3.0
#10: 10     7 3 2 4 1        2.5

This is more verbose, but a little over twice as fast:

microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369  1000
# nk_method()  7.047405  7.282276  7.757005  7.489351  7.667614  20.30658  1000

+1

Nick kennedy 22 jul. 15 at 19:19

source to share

David Arenburg · Accepted Answer · 2015-07-19T21:58:34+0000

First of all, when making a reproducible example using functions like sample

(which sets a random seed every time it runs), you should use set.seed

.

Second, instead of iterating over each line, you can simply iterate over the list lookup

, which will always be less data (many times much less), and concatenate it with rowMeans

. You can also do this with base R, but you asked for a solution data.table

, so here's (for the purposes of this solution, I converted all 9 to NA

s, but you can try to generalize this to your specific case too)

So, using set.seed(123)

, your function gives

apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000

And here's a possible application data.table

that only works on a list lookup

( for

loops in lists, very efficient in R btw, see here )

## convert all 9 values to NAs
is.na(dat) <- dat == 9L 
## convert your original data to `data.table`, 
## there is no need in additional copy of the data if the data is huge
setDT(dat)     
## loop only over the list
for(i in names(lookup)) {
  dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
#     id grade  A  B  C  D      res
#  1:  1     3  2 NA NA NA 2.000000
#  2:  2     4  5  3  5 NA 5.000000
#  3:  3     5  3  5  4  5 4.666667
#  4:  4     6 NA  4 NA  5 4.500000
#  5:  5     7 NA  1  4  1 2.500000
#  6:  6     3  1 NA  5  3 1.000000
#  7:  7     4  4  2  4  5 4.000000
#  8:  8     5 NA  1  4  2 2.333333
#  9: NA     6  4  2  2  2 2.500000
# 10: 10     7  3 NA  1  2 1.500000

Perhaps this could be improved with help set

, but I can't think of a good way right now.

PS

As suggested @Arun, please look at the vignettes, which he has written here , to get acquainted with the operator :=

, .SD

, with = FALSE

etc.

R data.table with variable number of columns

More articles: