Remove cases from DF if they are duplicated in certain columns, while other columns should be different
I have a large data frame with multiple columns and many rows (200k). I am ordering rows by group variable and each group can have one or more entries. The rest of the columns for each group should have the same values, but in some cases they do not. It looks like this:
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
I want to delete all records of a group if the age or city is not identical for all rows of the group (indicating an observation error). Otherwise, I want to keep all records.
The result I hope will be:
group name age color city
2 Martin 78 black LA
2 Martin 78 blue LA
4 Jake 33 blue NJ
The closest I've come to is:
dup <- df[ duplicated(df[,c("group","name","color")]) | duplicated(df[,c("group","name","color")],fromLast=TRUE) ,"group"]
df_nodup <- df[!(df$group %in% dup),]
However, this is far from all I need.
Ps: I had the same question answered by py / pandas. However, I would like to get a solution for R.
/ e: While Frank's answer was helpful in understanding the solution principle, and his second suggestion worked, it was very slow. (took ~ 15 minutes on my df). user20650's answer was harder to understand, but significantly faster (~ 10 seconds).
source to share
A similar approach to Franks, you can count length
combinations unique
age
and city
on group
- do it using ave
. Then you can multiply your data if the length of unique combinations is more than one
# your data
df <- read.table(text="group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ ", header=T)
# calculate and subset
df[with(df, ave(paste(age, city), group, FUN=function(x) length(unique(x))))==1,]
# group name age color city
# 4 2 Martin 78 black LA
# 5 2 Martin 78 blue LA
# 8 4 Jake 33 blue NJ
source to share
Here's the approach:
temp <- tapply(df$group, list(df$name, df$age, df$city), unique)
temp[!is.na(temp)] <- 1
keepers <- names(which(apply(temp, 1, sum, na.rm=TRUE)==1))
df[df$name %in% keepers, ]
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
An alternative, slightly simpler approach:
temp2 <- unique(df[,c('name','age','city')])
keepers2 <- names(which(tapply(temp2$name, temp2$name, length)==1))
df[df$name %in% keepers2, ]
# group name age color city
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
source to share
Here's an approach using dplyr:
df <- read.table(text = "
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
", header = TRUE)
library(dplyr)
df %>%
group_by(group) %>%
filter(n_distinct(age) == 1 && n_distinct(city) == 1)
I think it's pretty easy to figure out what's going on - you group and then filter to keep the groups when there is only one distinct age and city.
source to share