Comparing elements of two tables, averaging existing elements and leaving NA for nonexistent in R
I have two tables, the first table (T1) represents a range of numbers and the second (T2) includes a coordinate and a score that is a subdivision of the first column of T1.
I want to calculate the average score
for T2 and insert into T1 relative to the range and position NA
if no corresponding coordinate is available. let's say:
table 1: (T1)
start end
1000 1100
1300 1390
1530 1610
1800 1905
table 2: (T2)
coordinate score
1002 3
1004 1
1020 5
1087 4
1550 1
1559 7
1609 3
1805 2.5
result: averaging elements of T2 in the range T1: ex: 1000 to 1100 (3+1+5+1)/4
and no score between 1300 to 1390
, which is placed NA
, etc.
start end mean-score
1000 1100 3.25
1300 1390 NA
1530 1610 3.66
1800 1905 2.5
can you help me implement it in R?
Thank.
source to share
Prompt @akrun, I came across a function foverlaps
in "data.table". I'm not sure if this is the best way to do it (but it works :-))
library(data.table)
T1 <- as.data.table(T1)
T2 <- as.data.table(T2)
setkey(T1, start, end)
T2[, c("start", "end") := coordinate]
foverlaps(T2, T1)[, list(score = mean(score)), by = list(start, end)]
# start end score
# 1: 1000 1100 3.250000
# 2: 1530 1610 3.666667
# 3: 1800 1905 2.500000
Update:
As @Arun mentioned in the comments, if you also set Key to T2 and change the order foverlaps
, you can also get NA
.
setkey(T2, start, end)
foverlaps(T1, T2)[, list(mean = mean(score)), by = list(i.start, i.end)]
# i.start i.end mean
# 1: 1000 1100 3.250000
# 2: 1300 1390 NA
# 3: 1530 1610 3.666667
# 4: 1800 1905 2.500000
source to share
One of the methods -
T1$mean_score <- sapply(seq_len(nrow(T1)), function(i) {x1 <- T1[i,]
mean(T2$score[T2$coordinate>x1[,1]& T2$coordinate<=x1[,2]])})
T1
# start end mean_score
#1 1000 1100 3.250000
#2 1300 1390 NaN
#3 1530 1610 3.666667
#4 1800 1905 2.500000
data
T1 <- structure(list(start = c(1000L, 1300L, 1530L, 1800L), end = c(1100L,
1390L, 1610L, 1905L)), .Names = c("start", "end"), class = "data.frame", row.names = c(NA,
-4L))
T2 <- structure(list(coordinate = c(1002L, 1004L, 1020L, 1087L, 1550L,
1559L, 1609L, 1805L), score = c(3, 1, 5, 4, 1, 7, 3, 2.5)), .Names = c("coordinate",
"score"), class = "data.frame", row.names = c(NA, -8L))
source to share
The ability to use the dplyr
functions rowwise
, do
and between
.
library(dplyr)
T1 %>%
rowwise() %>%
do(data.frame(., mean_score = mean(T2$score[between(T2$coordinate, left = .$start, right = .$end)])))
# start end mean_score
# 1 1000 1100 3.250000
# 2 1300 1390 NaN
# 3 1530 1610 3.666667
# 4 1800 1905 2.500000
source to share