Conditional grouping and summing of data frames in [R]
I have a data frame like this:
df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"),
time = c(3.1,3.2,6.5,12.3, 3.2, 3.4),
intensity = c(10, 20, 30, 40, 50, 60))
| ID | time | intensity | |: - | ----: | ---------: | | A | 3.1 | 10 | | A | 3.2 | 20 | | B | 6.5 | 30 | | B | 12.3 | 40 | | C | 3.2 | 50 | | C | 3.4 | 60 |
I would like to aggregate values (intensities of sums) by ID only when the time difference is less, i.e. 0.3. I first calculated the time difference:
df.2 <- df %>%
group_by(ID) %>%
mutate(time.diff = max(time) - min(time))
... as a result:
| ID | time | intensity | time.diff | |: - | ----: | ---------: | ---------: | | A | 3.1 | 10 | 0.1 | | A | 3.2 | 20 | 0.1 | | B | 6.5 | 30 | 5.8 | | B | 12.3 | 40 | 5.8 | | C | 3.2 | 50 | 0.2 | | C | 3.4 | 60 | 0.2 |
To be clear, I would like to get the result:
| ID | time | intensity | time.diff | |: - | ----: | ---------: | ---------: | | A | 3.15 | 30 | 0.1 | | B | 6.5 | 30 | 5.8 | | B | 12.3 | 40 | 5.8 | | C | 3.3 | 110 | 0.2 |
where the time is now the average of the integrated observations, and the intensity is their sum. The identifier "B" stores two observations because the time difference is greater than 0.3. I tried with dplyr, but the summation will always throw one of the observations "B" and I want to keep them and I don't know how to do the conditional _group_by_.
I thank you for any idea!
source to share
Possible option with data.table
library(data.table)
unique(setDT(df)[, time.diff := max(time)-min(time), ID][
time.diff <= 0.3, c('time', 'intensity') := list(mean(time),
sum(intensity)), ID])
# ID time intensity time.diff
#1: A 3.15 30 0.1
#2: B 6.50 30 5.8
#3: B 12.30 40 5.8
#4: C 3.30 110 0.2
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(time.diff=max(time)-min(time), indx=all(time.diff<=0.3),
intensity=ifelse(indx, sum(intensity), intensity),
time=ifelse(indx, mean(time), time)) %>%
filter(!indx|row_number()==1) %>%
select(-indx)
# ID time intensity time.diff
#1 A 3.15 30 0.1
#2 B 6.50 30 5.8
#3 B 12.30 40 5.8
#4 C 3.30 110 0.2
source to share
Another variation of the solution data.table
:
setDT(df)[, time.diff := max(time) - min(time), by = ID
][, if (time.diff <= 0.3)
.(time = mean(time), intensity = sum(intensity))
else .SD, by = .(ID, time.diff)]
# ID time.diff time intensity
# 1: A 0.1 3.15 30
# 2: B 5.8 6.50 30
# 3: B 5.8 12.30 40
# 4: C 0.2 3.30 110
source to share
# get time.diff
df$time.diff <- ave(x = df$time,df$ID,FUN = function(x){max(x)-min(x)})
# new split variable to use with ID
df$cut <- cumsum(df$time.diff > .3)
# aggregate everything you need and ignore the cut variable
require(plyr)
ddply(df,c('cut','ID'),summarize,
time = mean(time),
intensity = sum(intensity),
time.diff = mean(time.diff))[2:5]
source to share
Usage sqldf
:
library(sqldf)
sqldf('SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))<0.3
UNION
SELECT ID, df.time, df.intensity, df2.dif
FROM (SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif
FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))>0.3) as df2
LEFT JOIN df USING (ID)')
Output:
ID time intensity dif
1 A 3.15 30 0.1
2 B 6.50 30 5.8
3 B 12.30 40 5.8
4 C 3.30 110 0.2
source to share