How can I dynamically use lapply in data.table?
I have a dataset that looks like
set.seed(18)
library(data.table)
site1 <- data.table(id = 1:10, A = c(sample(c(NA, letters[1:10]),10)),
B = sample(c(NA, LETTERS[1:7]), 10, replace = T),
C = sample(c(NA, 1:4), 10, replace = T))
site2 <- data.table(id = c(1:4, sample(5:15, 6)),
A = c(NA, NA, NA, sample(letters, 1), NA, NA, NA, sample(letters, 1), NA, NA),
B = sample(LETTERS, 10), d = sample(1:5, replace = T))
and a function that looks like
col.smash <- function(a, b, linkvars){
require(data.table)
##### CONVERT TO DATA.TABLES FOR EASIER USE, AND MERGE
if(dim(a)[1] <= dim(b)[1]){
c <- data.table(a); setkeyv(c, linkvars)
d <- data.table(b); setkeyv(d, linkvars)
} else {
c <- data.table(b); setkeyv(c, linkvars)
d <- data.table(a); setkeyv(d, linkvars)
}
k <- c[d]
rep.list<- names(a)[names(a) %in% names(b) & !(names(a) %in% linkvars)]
i.combo <- paste0("i.",rep.list)
f <- k[ , (rep.list) := lapply(.SD, function(x){ifelse(is.na(x),
get("i.", names(x)), x)}),
.SDcols = rep.list]
return(f)
}
The purpose of this function is to see what variables are in site1
and site2
, and if there is "NA" in, say site1$A
, replace it with the corresponding value in site2$A
. There is a hierarchy site1
over site2
, so the operator ifelse
only checks one variable with "NA".
I am getting the error in the function lapply
because the first result ifelse
( get("i.",names(x))
) after the condition is not working as expected. When doing this, I get the following error:
Error in as.environment(pos) : using 'as.environment(NULL)' is defunct
which I don't understand. Ideally, I would get data.table
all the values in site1
and site2
with variables A
, B
, C
, D
rather than with i.A
, i.B
for example,
id A B C d
1: 1 i E NA 4
2: 2 g F NA 4
3: 3 h NA 4 1
4: 4 x B 4 2
5: 5 j G NA NA
6: 6 c NA 3 4
7: 7 a D 2 NA
8: 8 b NA 2 NA
9: 9 d G 1 4
10: 10 f NA 1 NA
11: 12 NA V NA 2
12: 13 n J NA 1
13: 14 NA T NA 1
14: 15 NA X NA 1
So I think I really have two problems. The first is an error and the second is that I am not getting all the lines in k
in my function. They don't seem to be related.
Any help is appreciated.
Also, brown dots for those who might find the link incredible col.smash
.
source to share
The purpose of this function is to see what variables are in
site1
andsite2
, and if there is "NA" in, saysite1$A
, replace it with the corresponding value insite2$A
. There is a hierarchysite1
oversite2
The output can be obtained as
g <- function(d1, d2, byvars){
D = funion(d1[, ..byvars], d2[, ..byvars])
d2vars = setdiff(names(d2), byvars)
D[d2, on=byvars, (d2vars) := mget(sprintf("i.%s", d2vars))]
d1vars = setdiff(names(d1), byvars)
D[d1, on=byvars, (d1vars) := mget(sprintf("i.%s", d1vars))]
setcolorder(D, c(byvars, d1vars, setdiff(d2vars, d1vars)))
setorderv(D, byvars)[]
}
g(site1, site2, "id")
which gives
id A B C d
1: 1 i E NA 4
2: 2 g F NA 4
3: 3 h NA 4 1
4: 4 NA B 4 2
5: 5 j G NA NA
6: 6 c NA 3 4
7: 7 a D 2 NA
8: 8 b NA 2 NA
9: 9 d G 1 4
10: 10 f NA 1 NA
11: 12 NA V NA 2
12: 13 n J NA 1
13: 14 NA T NA 1
14: 15 NA X NA 1
How it works
The argument byvars
accepts a vector of column names.
The fairly new syntax ..
allows reference to an index on columns stored outside the data table. I looked through the FAQ and ?data.table
couldn't find any documentation. At the moment, this is the first element of changes in 1.10.2 at least .
To provide a "hierarchy of site1 over site2", we add site2 first and then site1, so it gets the last change.
Usage funion
assumes there are no duplicates in each table. If so, a more sophisticated approach to this step would be required, perhaps something like
D = rbind(d1[, ..byvars], d2[,..byvars][!d1, on=byvars])
source to share