For each group, find observations with the maximum value of several columns
Suppose I have a data frame like this:
set.seed(4)
df<-data.frame(
group = rep(1:10, each=3),
id = rep(sample(1:3), 10),
x = sample(c(rep(0, 15), runif(15))),
y = sample(c(rep(0, 15), runif(15))),
z = sample(c(rep(0, 15), runif(15)))
)
As seen above, some elements of the vectors x
, y
, z
are set to zero, and the rest is extracted from a uniform distribution between 0 and 1.
For each group defined the first column, I want to find three identifiers from the second column indicating the highest value of the variables x
, y
, z
in the group. Let's assume there is no tie, except when the variable is set to 0 in all cases of the given group - in which case I don't want to return any number as the ID of the row with the maximum value.
The result will look like this:
group x y z
1 2 2 1
2 2 3 1
... .........
My first thought is to select the rows with maximum values ββseparately for each variable, and then use merge
to put it in one table. However I am wondering if it can be done without merge
, for example, with standard functions dplyr
.
source to share
Here is my suggested solution using plyr
:
ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})
# group x y z
#1 1 2 2 1
#2 2 2 3 1
#3 3 1 3 2
#4 4 3 3 1
#5 5 2 3 NA
#6 6 3 1 3
#7 7 1 1 2
#8 8 NA 2 3
#9 9 2 1 3
#10 10 2 NA 2
source to share
The solution uses dplyr
and tidyr
. Please note that if all numbers match, we cannot decide which one id
should be selected. Therefore, to remove these entries is added filter(n_distinct(Value) > 1)
. In the final exit df2
, NA
indicates such a condition when all numbers are the same. We can decide whether to call those later NA
if we want. This solution should work for any numbers id
or columns ( x
, y
, z
...).
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, -id) %>%
arrange(group, Column, desc(Value)) %>%
group_by(group, Column) %>%
# If all values from a group-Column are all the same, remove that group-Column
filter(n_distinct(Value) > 1) %>%
slice(1) %>%
select(-Value) %>%
spread(Column, id)
source to share
If you only want to stick with dplyr
, you can use multi-column functions summarize
/ mutate
. This should work regardless of the shape id
; my initial attempt was slightly cleaner, but assumed the value id
to be zero.
df %>%
group_by(group) %>%
mutate_at(vars(-id),
# If the row is the max within the group, set the value
# to the id and use NA otherwise
funs(ifelse(max(.) != 0 & . == max(.),
id,
NA))) %>%
select(-id) %>%
summarize_all(funs(
# There are zero or one non-NA values per group, so handle both cases
if(any(!is.na(.)))
na.omit(.) else NA))
## # A tibble: 10 x 4
## group x y z
## <int> <int> <int> <int>
## 1 1 2 2 1
## 2 2 2 3 1
## 3 3 1 3 2
## 4 4 3 3 1
## 5 5 2 3 NA
## 6 6 3 1 3
## 7 7 1 1 2
## 8 8 NA 2 3
## 9 9 2 1 3
## 10 10 2 NA 2
source to share