How to aggregate in R with a custom function that uses two columns

Is it possible to aggregate with a custom function that uses two columns to return one column?

Let's say I have a dataframe:

x <- c(2,4,3,1,5,7)
y <- c(3,2,6,3,4,6)
group <- c("A","A","A","A","B","B")

data <- data.frame(group, x, y)
data
#   group x y
# 1     A 2 3
# 2     A 4 2
# 3     A 3 6
# 4     A 1 3
# 5     B 5 4
# 6     B 7 6

      

And I have my function that I want to use on two columns (x and y):

pathlength <- function(xy) {
  out <- as.matrix(dist(xy))
  sum(out[row(out) - col(out) == 1])
}

      

I tried the following with an aggregate:

out <- aggregate(cbind(x, y) ~ group, data, FUN = pathlength)  
out <- aggregate(cbind(x, y) ~ group, data, function(x) pathlength(x))  

      

However, this calls the length of the path along x and y separately rather than together, giving me:

#  group x y
#1     A 5 8
#2     B 2 2

      

I want it to call pathlength by x and y together and aggregate it that way. Here's what I want to do for the aggregate:

realA <- matrix(c(2,4,3,1,3,2,6,3), nrow=4, ncol=2)
pathlength(realA)
# [1] 9.964725

realB <- matrix(c(5,7,4,6), nrow=2, ncol=2)
pathlength(realB)
# [1] 2.828427

group <- c("A", "B") 
pathlength <- c(9.964725,2.828427)
real_out <- data.frame(group, pathlength)
real_out
#   group pathlength
# 1     A   9.964725
# 2     B   2.828427

      

Does anyone have any suggestions? Or is there some other function that I cannot find on google that will allow me to do this? I would rather not get around this by using a for loop as I assume it will be slow for a large dataset.

+3


source to share


3 answers


As you learned, the basic function aggregate()

only works one column at a time. You can use the function insteadby()

by(data[,c("x","y")], data$group, pathlength)
data$group: A
[1] 9.964725
----------------------------------------------------------------------- 
data$group: B
[1] 2.828427

      



or split()/lapply()

lapply(split(data[,c("x","y")], data$group), pathlength)
$A
[1] 9.964725

$B
[1] 2.828427

      

+6


source


New answer

As @BrodieG pointed out, it's easy to do with "data.table":

> as.data.table(data)[, pathlength(.SD), by = group]
   group       V1
1:     A 9.964725
2:     B 2.828427

      


Original overshoot response

You can do matrix

on-the-fly input to the "data.table":

library(data.table)

as.data.table(data)[, pathlength(matrix(unlist(.SD), ncol = length(.SD))), by = group]
#    group       V1
# 1:     A 9.964725
# 2:     B 2.828427

      

So, you can also consider creating a helper function, such as the following, which will create a matrix for you:

sdmat <- function(sd) matrix(unlist(sd), ncol = length(sd))

      

Then you can do:



as.data.table(data)[, pathlength(sdmat(.SD)), by = group]
#    group       V1
# 1:     A 9.964725
# 2:     B 2.828427

      

Or even:

as.data.table(data)[, pathlength(sdmat(list(x, y))), by = group]
#    group       V1
# 1:     A 9.964725
# 2:     B 2.828427

      


Alternatively, you can try "dplyr":

library(dplyr)

data %>%
  group_by(group) %>%
  summarise(pathlength = pathlength(matrix(c(x, y), ncol = 2)))
# Source: local data frame [2 x 2]
# 
#   group pathlength
# 1     A   9.964725
# 2     B   2.828427

      


Alternatively, you can hide the long format data and then use your favorite aggregation function.

Here's a continuation with "dplyr":

library(dplyr)
library(tidyr)

data %>%
  gather(var, val, -group) %>%
  group_by(group) %>%
  summarise(pathlength = pathlength(matrix(val, ncol = length(unique(var)))))
# Source: local data frame [2 x 2]
# 
#   group pathlength
# 1     A   9.964725
# 2     B   2.828427

      

+3


source


If anyone wants another simple solution, I ended up using ddply. It turns out that you can use a function on multiple columns with ddply, as opposed to an aggregate.

Here's the code:

out <- ddply(data, "group", summarise,
                       pathlength = pathlength(cbind(x,y)))

      

0


source







All Articles