A faster way to calculate the distance between all people during each time step

I have position data for several individuals, each registered at multiple time steps. I want to calculate the distance between each animal to all other animals registered in the same step.

Here's a simplified example with data about three individuals ("animal_id") registered on three dates ("date") each, at different positions ("x", "y"):

library(data.table)
dt1 <- data.table(animal_id = 1, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")), 
                  x = runif(3, -10, 10), y = runif(3, -10, 10))
dt2 <- data.table(animal_id = 2, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")), 
                  x = runif(3, -10, 10), y = runif(3, -10, 10))
dt3 <- data.table(animal_id = 3, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")), 
                  x = runif(3, -10, 10), y = runif(3, -10, 10))
dt <- rbindlist(list(dt1, dt2, dt3))

# Create dist function between two animals at same time
dist.between.animals <- function(collar_id1, x1, y1, collar_id2, x2, y2) {
  if (collar_id1 == collar_id2) return(NA)
  sqrt((x1 - x2)^2 + (y1 - y2)^2)
}

# Get unique collar id of each animal, create column name for all animals per animal
animal_ids <- dt[ , unique(animal_id)]
animal_ids_str <- dt[,paste0("dist_to_", unique(animal_id))]
datetimes <- dt[ , unique(date)]

# Calculate distance of each animal to all animals, at same time
for (i in 1:length(animal_ids)) {
  for (j in 1:length(datetimes)) {
    x1 <- dt[.(animal_ids[i], datetimes[j]), x, on = .(animal_id, date)]
    y1 <- dt[.(animal_ids[i], datetimes[j]), y, on = .(animal_id, date)]
    dt[date == datetimes[j], animal_ids_str[i] := mapply(function(c, x2, y2) dist.between.animals(animal_ids[i], x1, y1, c, x2, y2), animal_id, x, y)]
  }
}

      

Here's an example of what the output should look like:

   animal_id       date          x          y dist_to_1  dist_to_2  dist_to_3
1:         1 2014-01-01 -7.0276047  4.7660664        NA  7.1354265 13.7962800
2:         1 2014-01-02 -6.6383802  7.0087919        NA  3.7003879 16.4294999
3:         1 2014-01-03 -0.9722872 -4.8638019        NA 11.6447645 11.8313410
4:         2 2014-01-01  0.1076661  4.8131960  7.135426         NA  7.7052205
5:         2 2014-01-02 -8.9042124  4.0832364  3.700388         NA 13.3225921
6:         2 2014-01-03  8.2858839  2.1992575 11.644764         NA  0.4569632
7:         3 2014-01-01  5.7519522 -0.4320359 13.796280  7.7052205         NA
8:         3 2014-01-02 -9.0805265 -9.2381889 16.429500 13.3225921         NA
9:         3 2014-01-03  8.6832729  1.9736531 11.831341  0.4569632         NA

      

However, my actual data contains about 30 animals with 20,000 observations per animal, so my current code is taking a long time. Is there a more efficient way to do this?

+3


source to share


3 answers


You can do a self-join by date ( dt[dt, on = "date",

), and for each match ( by = .EACHI

) calculate the distance:

dt[dt, on = "date",
   .(from_id = id, to_id = i.id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI]

      


I want to pass data in wide format ( dcast

), bind that to the above code:

[ , dcast(.SD, from_id + date ~ to_id, value.var = "dist")]

      




Seems to be OK in test job using @digEmAll data

library(microbenchmark)
microbenchmark(
  digemall = dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date],
  henrik =   dt[dt, on = "date",
                .(from_id = animal_id, to_id = i.animal_id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI][
                  , dcast(.SD, from_id + date ~ to_id, value.var = "dist")],
  times = 5, unit = "relative")

# Unit: relative
#     expr      min       lq     mean   median       uq      max neval
# digemall 3.206063 2.058547 2.189487 2.035975 2.032324 2.019082     5
#   henrik 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000     5

      


Please note that I have not renamed "to_id" in my code. This mainly reflects my preference for keeping the data in a long format, and in that format I would like to have both "from_id" and "to_id" unprefixed. If you want to prefix the columns in wide format, just add to_id = paste0("dist_to_", i.animal_id)

in the first step.

+2


source


OK, so here's a bit of an unorthodox method, especially considering that this time around I think the data data is making things worse. I am using a function dist

that calculates Euclidean distance (or whatever, your selection). If you use diag=T, upper=T

, it generates a matrix which can then be assigned to the specified column-to-columns. Column creation can be tedious with multiple animals, but nothing that the function paste

can't fix.

dt[, c("dist_to_1", "dist_to_2", "dist_to_3") := NA]
dt<- arrange(dt, date, animal_id) # order by date. here it turns into a data.frame

for (i in 1:length(unique(dt$date))){
    sub<- subset(dt, dt$date == unique(dt$date)[i])
    dt[which(dt$date == unique(sub$date)), c("dist_to_1", "dist_to_2", "dist_to_3")]<- as.matrix(dist(sub[, c("x","y")], diag=T, upper=T))
}

dt[dt==0]<- NA #assign NAs for 0s. Not necessary if the it ok for diag==0.
setDT(dt) # back to datatable. Again this part is not really necessary.
dt<- dt[order(animal_id, date)] # order as initially ordered

      

Using this code:

> proc.time()-ptm
   user  system elapsed 
  0.051   0.007   0.068 

      



Using earlier code:

> proc.time()-ptm
   user  system elapsed 
  0.083   0.004   0.092 

      

If you find a way to use both dist

, data.table

you are golden, but I couldn't figure it out. This is pretty fast since it calls C, and it will be faster the more additions you add.

+3


source


Here's an alternative approach that should be much faster:

library(data.table)

### CREATE A BIG DATASET
set.seed(123)
nSamples <- 20000
nAnimals <- 30
allDates <- as.POSIXct(c("2014-01-01")) + (1:nSamples)*24*3600
dts <- lapply(1:nAnimals, function(id){
                            data.table(animal_id=id,date=allDates,
                                       x=runif(nSamples,-10,10), y=runif(nSamples,-10,10))
              })
dt <- rbindlist(dts)

### ALTERNATIVE APPROACH (NO LOOP)
animal_ids_str <- dt[,paste0("dist_to_",sort(unique(animal_id)))]
# set keys
setkey(dt,animal_id,date)
# add the distance columns
dt[,(animal_ids_str):=as.double(NA)]

# custom function that computes animal distances for a subset of dt at the same date
distancesInSameDate <- function(subsetDT,animal_ids_str){
  m <- as.matrix(dist(subsetDT[,.(x,y)]))
  diag(m) <- NA
  cols <- paste0("dist_to_",subsetDT$animal_id)
  missingCols <- animal_ids_str[is.na(match(animal_ids_str,cols))]
  m <- cbind(m,matrix(NA,nrow=nrow(m),ncol=length(missingCols)))
  colnames(m) <- c(cols,missingCols)
  DF <- as.data.frame(m,stringsAsFactors=F)
  DF <- DF[,match(animal_ids_str,colnames(DF))]
  return(DF)
}
# let compute the distances
system.time( dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date] )

      

My machine requires:

   user  system elapsed 
  13.76    0.00   13.82 

      

+2


source







All Articles