For each group, find observations with the maximum value of several columns

Suppose I have a data frame like this:

set.seed(4)
df<-data.frame(
    group = rep(1:10, each=3),
    id = rep(sample(1:3), 10),
    x = sample(c(rep(0, 15), runif(15))),
    y = sample(c(rep(0, 15), runif(15))),
    z = sample(c(rep(0, 15), runif(15)))
)

      

As seen above, some elements of the vectors x

, y

, z

are set to zero, and the rest is extracted from a uniform distribution between 0 and 1.

For each group defined the first column, I want to find three identifiers from the second column indicating the highest value of the variables x

, y

, z

in the group. Let's assume there is no tie, except when the variable is set to 0 in all cases of the given group - in which case I don't want to return any number as the ID of the row with the maximum value.

The result will look like this:

group  x  y  z
  1    2  2  1
  2    2  3  1
 ...  .........

      

My first thought is to select the rows with maximum values ​​separately for each variable, and then use merge

to put it in one table. However I am wondering if it can be done without merge

, for example, with standard functions dplyr

.

+3


source to share


3 answers


Here is my suggested solution using plyr

:



ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})

#   group  x  y  z
#1      1  2  2  1
#2      2  2  3  1
#3      3  1  3  2
#4      4  3  3  1
#5      5  2  3 NA
#6      6  3  1  3
#7      7  1  1  2
#8      8 NA  2  3
#9      9  2  1  3
#10    10  2 NA  2

      

+2


source


The solution uses dplyr

and tidyr

. Please note that if all numbers match, we cannot decide which one id

should be selected. Therefore, to remove these entries is added filter(n_distinct(Value) > 1)

. In the final exit df2

, NA

indicates such a condition when all numbers are the same. We can decide whether to call those later NA

if we want. This solution should work for any numbers id

or columns ( x

, y

, z

...).



library(dplyr)
library(tidyr)

df2 <- df %>%
  gather(Column, Value, -group, -id) %>%
  arrange(group, Column, desc(Value)) %>%
  group_by(group, Column) %>%
  # If all values from a group-Column are all the same, remove that group-Column
  filter(n_distinct(Value) > 1) %>%
  slice(1) %>%
  select(-Value) %>%
  spread(Column, id)

      

+2


source


If you only want to stick with dplyr

, you can use multi-column functions summarize

/ mutate

. This should work regardless of the shape id

; my initial attempt was slightly cleaner, but assumed the value id

to be zero.

df %>%
  group_by(group) %>%
  mutate_at(vars(-id), 
            # If the row is the max within the group, set the value
            # to the id and use NA otherwise
            funs(ifelse(max(.) != 0 & . == max(.),
                        id,
                        NA))) %>%
  select(-id) %>%
  summarize_all(funs(
    # There are zero or one non-NA values per group, so handle both cases
    if(any(!is.na(.)))
      na.omit(.) else NA))
## # A tibble: 10 x 4
##    group     x     y     z
##    <int> <int> <int> <int>
##  1     1     2     2     1
##  2     2     2     3     1
##  3     3     1     3     2
##  4     4     3     3     1
##  5     5     2     3    NA
##  6     6     3     1     3
##  7     7     1     1     2
##  8     8    NA     2     3
##  9     9     2     1     3
## 10    10     2    NA     2

      

+2


source







All Articles