How can I dynamically use lapply in data.table?

I have a dataset that looks like

set.seed(18)
library(data.table)
site1 <- data.table(id = 1:10, A = c(sample(c(NA, letters[1:10]),10)), 
                    B = sample(c(NA, LETTERS[1:7]), 10, replace = T),
                    C = sample(c(NA, 1:4), 10, replace = T))

site2 <- data.table(id = c(1:4, sample(5:15, 6)), 
                    A = c(NA, NA, NA, sample(letters, 1), NA, NA, NA, sample(letters, 1), NA, NA), 
                    B = sample(LETTERS, 10), d = sample(1:5, replace = T))

      

and a function that looks like

col.smash <- function(a, b, linkvars){
  require(data.table)
  
  ##### CONVERT TO DATA.TABLES FOR EASIER USE, AND MERGE
  if(dim(a)[1] <= dim(b)[1]){
    c <- data.table(a); setkeyv(c, linkvars)
    d <- data.table(b); setkeyv(d, linkvars)
  } else {
    c <- data.table(b); setkeyv(c, linkvars)
    d <- data.table(a); setkeyv(d, linkvars)
  }
 
  k <- c[d]
  
  rep.list<- names(a)[names(a) %in% names(b) & !(names(a) %in% linkvars)]
  i.combo <- paste0("i.",rep.list)

  f <- k[ , (rep.list) := lapply(.SD, function(x){ifelse(is.na(x), 
                                                   get("i.", names(x)), x)}), 
          .SDcols = rep.list]
  return(f)
  }

      

The purpose of this function is to see what variables are in site1

and site2

, and if there is "NA" in, say site1$A

, replace it with the corresponding value in site2$A

. There is a hierarchy site1

over site2

, so the operator ifelse

only checks one variable with "NA".

I am getting the error in the function lapply

because the first result ifelse

( get("i.",names(x))

) after the condition is not working as expected. When doing this, I get the following error:

Error in as.environment(pos) : using 'as.environment(NULL)' is defunct

      

which I don't understand. Ideally, I would get data.table

all the values in site1

and site2

with variables A

, B

, C

, D

rather than with i.A

, i.B

for example,

    id  A  B  C  d
 1:  1  i  E NA  4
 2:  2  g  F NA  4
 3:  3  h NA  4  1
 4:  4  x  B  4  2
 5:  5  j  G NA  NA
 6:  6  c NA  3  4
 7:  7  a  D  2  NA
 8:  8  b NA  2  NA
 9:  9  d  G  1  4
10: 10  f NA  1  NA
11: 12 NA  V NA  2
12: 13  n  J NA  1
13: 14 NA  T NA  1
14: 15 NA  X NA  1

      

So I think I really have two problems. The first is an error and the second is that I am not getting all the lines in k

in my function. They don't seem to be related.

Any help is appreciated.

Also, brown dots for those who might find the link incredible col.smash

.

+3


source to share


1 answer


The purpose of this function is to see what variables are in site1

and site2

, and if there is "NA" in, say site1$A

, replace it with the corresponding value in site2$A

. There is a hierarchy site1

oversite2

The output can be obtained as

g <- function(d1, d2, byvars){
  D = funion(d1[, ..byvars], d2[, ..byvars])

  d2vars = setdiff(names(d2), byvars)
  D[d2, on=byvars, (d2vars) := mget(sprintf("i.%s", d2vars))]

  d1vars = setdiff(names(d1), byvars)
  D[d1, on=byvars, (d1vars) := mget(sprintf("i.%s", d1vars))]  

  setcolorder(D, c(byvars, d1vars, setdiff(d2vars, d1vars)))
  setorderv(D, byvars)[]
}

g(site1, site2, "id")

      

which gives

    id  A  B  C  d
 1:  1  i  E NA  4
 2:  2  g  F NA  4
 3:  3  h NA  4  1
 4:  4 NA  B  4  2
 5:  5  j  G NA NA
 6:  6  c NA  3  4
 7:  7  a  D  2 NA
 8:  8  b NA  2 NA
 9:  9  d  G  1  4
10: 10  f NA  1 NA
11: 12 NA  V NA  2
12: 13  n  J NA  1
13: 14 NA  T NA  1
14: 15 NA  X NA  1

      




How it works

The argument byvars

accepts a vector of column names.

The fairly new syntax ..

allows reference to an index on columns stored outside the data table. I looked through the FAQ and ?data.table

couldn't find any documentation. At the moment, this is the first element of changes in 1.10.2 at least .

To provide a "hierarchy of site1 over site2", we add site2 first and then site1, so it gets the last change.

Usage funion

assumes there are no duplicates in each table. If so, a more sophisticated approach to this step would be required, perhaps something like

D = rbind(d1[, ..byvars], d2[,..byvars][!d1, on=byvars])

      

+3


source







All Articles