Comparing elements of two tables, averaging existing elements and leaving NA for nonexistent in R

I have two tables, the first table (T1) represents a range of numbers and the second (T2) includes a coordinate and a score that is a subdivision of the first column of T1.

I want to calculate the average score

for T2 and insert into T1 relative to the range and position NA

if no corresponding coordinate is available. let's say:

table 1: (T1)

    start    end    
    1000    1100
    1300    1390
    1530    1610
    1800    1905

      

table 2: (T2)

coordinate  score
1002         3
1004         1
1020         5
1087         4
1550         1
1559         7
1609         3
1805        2.5

      

result: averaging elements of T2 in the range T1: ex: 1000 to 1100 (3+1+5+1)/4

and no score between 1300 to 1390

, which is placed NA

, etc.

start    end  mean-score  
1000    1100   3.25
1300    1390   NA
1530    1610   3.66
1800    1905   2.5

      

can you help me implement it in R?

Thank.

+3


source to share


3 answers


Prompt @akrun, I came across a function foverlaps

in "data.table". I'm not sure if this is the best way to do it (but it works :-))

library(data.table)
T1 <- as.data.table(T1)
T2 <- as.data.table(T2)
setkey(T1, start, end)
T2[, c("start", "end") := coordinate]
foverlaps(T2, T1)[, list(score = mean(score)), by = list(start, end)]
#    start  end    score
# 1:  1000 1100 3.250000
# 2:  1530 1610 3.666667
# 3:  1800 1905 2.500000

      




Update:

As @Arun mentioned in the comments, if you also set Key to T2 and change the order foverlaps

, you can also get NA

.

setkey(T2, start, end)
foverlaps(T1, T2)[, list(mean = mean(score)), by = list(i.start, i.end)]
#    i.start i.end     mean
# 1:    1000  1100 3.250000
# 2:    1300  1390       NA
# 3:    1530  1610 3.666667
# 4:    1800  1905 2.500000

      

+4


source


One of the methods -

T1$mean_score <- sapply(seq_len(nrow(T1)), function(i) {x1 <- T1[i,]
                  mean(T2$score[T2$coordinate>x1[,1]& T2$coordinate<=x1[,2]])})

 T1
 #  start  end mean_score
#1  1000 1100   3.250000
#2  1300 1390        NaN
#3  1530 1610   3.666667
#4  1800 1905   2.500000

      



data

T1 <- structure(list(start = c(1000L, 1300L, 1530L, 1800L), end = c(1100L, 
 1390L, 1610L, 1905L)), .Names = c("start", "end"), class = "data.frame", row.names = c(NA, 
 -4L))


T2 <-  structure(list(coordinate = c(1002L, 1004L, 1020L, 1087L, 1550L, 
 1559L, 1609L, 1805L), score = c(3, 1, 5, 4, 1, 7, 3, 2.5)), .Names = c("coordinate", 
 "score"), class = "data.frame", row.names = c(NA, -8L))

      

+3


source


The ability to use the dplyr

functions rowwise

, do

and between

.

library(dplyr)

T1 %>%
  rowwise() %>%
  do(data.frame(., mean_score = mean(T2$score[between(T2$coordinate, left = .$start, right = .$end)])))
#   start  end mean_score
# 1  1000 1100   3.250000
# 2  1300 1390        NaN
# 3  1530 1610   3.666667
# 4  1800 1905   2.500000

      

+2


source







All Articles