Conditional grouping and summing of data frames in [R]

Question

Conditional grouping and summing of data frames in [R]

I have a data frame like this:

df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"), 
                 time = c(3.1,3.2,6.5,12.3, 3.2, 3.4), 
                 intensity = c(10, 20, 30, 40, 50, 60))

| ID | time | intensity |
|: - | ----: | ---------: |
| A | 3.1 | 10 |
| A | 3.2 | 20 |
| B | 6.5 | 30 |
| B | 12.3 | 40 |
| C | 3.2 | 50 |
| C | 3.4 | 60 |

I would like to aggregate values (intensities of sums) by ID only when the time difference is less, i.e. 0.3. I first calculated the time difference:

df.2 <- df %>% 
        group_by(ID) %>% 
        mutate(time.diff = max(time) - min(time))

... as a result:

| ID | time | intensity | time.diff |
|: - | ----: | ---------: | ---------: |
| A | 3.1 | 10 | 0.1 |
| A | 3.2 | 20 | 0.1 |
| B | 6.5 | 30 | 5.8 |
| B | 12.3 | 40 | 5.8 |
| C | 3.2 | 50 | 0.2 |
| C | 3.4 | 60 | 0.2 |

To be clear, I would like to get the result:

| ID | time | intensity | time.diff |
|: - | ----: | ---------: | ---------: |
| A | 3.15 | 30 | 0.1 |
| B | 6.5 | 30 | 5.8 |
| B | 12.3 | 40 | 5.8 |
| C | 3.3 | 110 | 0.2 |

where the time is now the average of the integrated observations, and the intensity is their sum. The identifier "B" stores two observations because the time difference is greater than 0.3. I tried with dplyr, but the summation will always throw one of the observations "B" and I want to keep them and I don't know how to do the conditional _group_by_.

I thank you for any idea!

+3

r dplyr

mesontau May 27 '15 at 16:06

source to share

4 answers

Another variation of the solution data.table

:

setDT(df)[, time.diff := max(time) - min(time), by = ID
        ][, if (time.diff <= 0.3) 
                .(time = mean(time), intensity = sum(intensity))
            else .SD, by = .(ID, time.diff)]
#    ID time.diff  time intensity
# 1:  A       0.1  3.15        30
# 2:  B       5.8  6.50        30
# 3:  B       5.8 12.30        40
# 4:  C       0.2  3.30       110

+3

Arun May 27 '15 at 17:12

source to share

# get time.diff
df$time.diff <- ave(x = df$time,df$ID,FUN = function(x){max(x)-min(x)})

# new split variable to use with ID
df$cut <- cumsum(df$time.diff > .3)

# aggregate everything you need and ignore the cut variable
require(plyr)
ddply(df,c('cut','ID'),summarize,
      time = mean(time),
      intensity = sum(intensity),
      time.diff = mean(time.diff))[2:5]

+1

ARobertson May 27 '15 at 18:47

source to share

Usage sqldf

:

library(sqldf)
sqldf('SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif FROM df 
         GROUP BY ID 
         HAVING (MAX(time)-MIN(time))<0.3
         UNION
         SELECT ID, df.time, df.intensity, df2.dif
         FROM (SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif
         FROM df 
         GROUP BY ID 
         HAVING (MAX(time)-MIN(time))>0.3) as df2
         LEFT JOIN df USING (ID)')

Output:

  ID  time intensity dif
1  A  3.15        30 0.1
2  B  6.50        30 5.8
3  B 12.30        40 5.8
4  C  3.30       110 0.2

+1

mpalanco 13 jul. '15 at 9:43

source to share

akrun · Accepted Answer · 2015-05-27T16:17:29+0000

Possible option with data.table

library(data.table)
unique(setDT(df)[, time.diff := max(time)-min(time), ID][
   time.diff <= 0.3, c('time', 'intensity') := list(mean(time),
        sum(intensity)), ID]) 
#    ID  time intensity time.diff
#1:  A  3.15        30       0.1
#2:  B  6.50        30       5.8
#3:  B 12.30        40       5.8
#4:  C  3.30       110       0.2

Or using dplyr

library(dplyr)
df %>% 
   group_by(ID) %>%
   mutate(time.diff=max(time)-min(time), indx=all(time.diff<=0.3),
         intensity=ifelse(indx, sum(intensity), intensity),
         time=ifelse(indx, mean(time), time)) %>% 
   filter(!indx|row_number()==1) %>%
   select(-indx)
 #  ID  time intensity time.diff
 #1  A  3.15        30       0.1
 #2  B  6.50        30       5.8
 #3  B 12.30        40       5.8
 #4  C  3.30       110       0.2

Conditional grouping and summing of data frames in [R]

More articles: