Conditional grouping and summing of data frames in [R]

I have a data frame like this:

df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"), 
                 time = c(3.1,3.2,6.5,12.3, 3.2, 3.4), 
                 intensity = c(10, 20, 30, 40, 50, 60))

      

| ID | time | intensity |
|: - | ----: | ---------: |
| A | 3.1 | 10 |
| A | 3.2 | 20 |
| B | 6.5 | 30 |
| B | 12.3 | 40 |
| C | 3.2 | 50 |
| C | 3.4 | 60 |

I would like to aggregate values ​​(intensities of sums) by ID only when the time difference is less, i.e. 0.3. I first calculated the time difference:

df.2 <- df %>% 
        group_by(ID) %>% 
        mutate(time.diff = max(time) - min(time)) 

      

... as a result:

| ID | time | intensity | time.diff |
|: - | ----: | ---------: | ---------: |
| A | 3.1 | 10 | 0.1 |
| A | 3.2 | 20 | 0.1 |
| B | 6.5 | 30 | 5.8 |
| B | 12.3 | 40 | 5.8 |
| C | 3.2 | 50 | 0.2 |
| C | 3.4 | 60 | 0.2 |

To be clear, I would like to get the result:

| ID | time | intensity | time.diff |
|: - | ----: | ---------: | ---------: |
| A | 3.15 | 30 | 0.1 |
| B | 6.5 | 30 | 5.8 |
| B | 12.3 | 40 | 5.8 |
| C | 3.3 | 110 | 0.2 |

where the time is now the average of the integrated observations, and the intensity is their sum. The identifier "B" stores two observations because the time difference is greater than 0.3. I tried with dplyr, but the summation will always throw one of the observations "B" and I want to keep them and I don't know how to do the conditional _group_by_.

I thank you for any idea!

+3


source to share


4 answers


Possible option with data.table

library(data.table)
unique(setDT(df)[, time.diff := max(time)-min(time), ID][
   time.diff <= 0.3, c('time', 'intensity') := list(mean(time),
        sum(intensity)), ID]) 
#    ID  time intensity time.diff
#1:  A  3.15        30       0.1
#2:  B  6.50        30       5.8
#3:  B 12.30        40       5.8
#4:  C  3.30       110       0.2

      



Or using dplyr

library(dplyr)
df %>% 
   group_by(ID) %>%
   mutate(time.diff=max(time)-min(time), indx=all(time.diff<=0.3),
         intensity=ifelse(indx, sum(intensity), intensity),
         time=ifelse(indx, mean(time), time)) %>% 
   filter(!indx|row_number()==1) %>%
   select(-indx)
 #  ID  time intensity time.diff
 #1  A  3.15        30       0.1
 #2  B  6.50        30       5.8
 #3  B 12.30        40       5.8
 #4  C  3.30       110       0.2

      

+3


source


Another variation of the solution data.table

:



setDT(df)[, time.diff := max(time) - min(time), by = ID
        ][, if (time.diff <= 0.3) 
                .(time = mean(time), intensity = sum(intensity))
            else .SD, by = .(ID, time.diff)]
#    ID time.diff  time intensity
# 1:  A       0.1  3.15        30
# 2:  B       5.8  6.50        30
# 3:  B       5.8 12.30        40
# 4:  C       0.2  3.30       110

      

+3


source


# get time.diff
df$time.diff <- ave(x = df$time,df$ID,FUN = function(x){max(x)-min(x)})

# new split variable to use with ID
df$cut <- cumsum(df$time.diff > .3)

# aggregate everything you need and ignore the cut variable
require(plyr)
ddply(df,c('cut','ID'),summarize,
      time = mean(time),
      intensity = sum(intensity),
      time.diff = mean(time.diff))[2:5]

      

+1


source


Usage sqldf

:

library(sqldf)
sqldf('SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif FROM df 
         GROUP BY ID 
         HAVING (MAX(time)-MIN(time))<0.3
         UNION
         SELECT ID, df.time, df.intensity, df2.dif
         FROM (SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif
         FROM df 
         GROUP BY ID 
         HAVING (MAX(time)-MIN(time))>0.3) as df2
         LEFT JOIN df USING (ID)')

      

Output:

  ID  time intensity dif
1  A  3.15        30 0.1
2  B  6.50        30 5.8
3  B 12.30        40 5.8
4  C  3.30       110 0.2

      

+1


source







All Articles