Remove cases from DF if they are duplicated in certain columns, while other columns should be different

Question

Remove cases from DF if they are duplicated in certain columns, while other columns should be different

I have a large data frame with multiple columns and many rows (200k). I am ordering rows by group variable and each group can have one or more entries. The rest of the columns for each group should have the same values, but in some cases they do not. It looks like this:

group   name    age    color    city
1       Anton   50     orange   NY
1       Anton   21     red      NY
1       Anton   21     red      NJ
2       Martin  78     black    LA
2       Martin  78     blue     LA
3       Maria   29     red      NC
3       Maria   29     pink     LV
4       Jake    33     blue     NJ

I want to delete all records of a group if the age or city is not identical for all rows of the group (indicating an observation error). Otherwise, I want to keep all records.

The result I hope will be:

group   name    age    color    city
2       Martin  78     black    LA
2       Martin  78     blue     LA
4       Jake    33     blue     NJ

The closest I've come to is:

dup <- df[ duplicated(df[,c("group","name","color")]) | duplicated(df[,c("group","name","color")],fromLast=TRUE)    ,"group"]
df_nodup <- df[!(df$group %in% dup),]

However, this is far from all I need.

Ps: I had the same question answered by py / pandas. However, I would like to get a solution for R.

/ e: While Frank's answer was helpful in understanding the solution principle, and his second suggestion worked, it was very slow. (took ~ 15 minutes on my df). user20650's answer was harder to understand, but significantly faster (~ 10 seconds).

+3

r duplicates dataframe

cover51 23 Aug 14 at 21:54

source to share

3 answers

Here's the approach:

temp <- tapply(df$group, list(df$name, df$age, df$city), unique)  
temp[!is.na(temp)] <- 1
keepers <- names(which(apply(temp, 1, sum, na.rm=TRUE)==1))

df[df$name %in% keepers, ]
#4     2 Martin  78 black   LA
#5     2 Martin  78  blue   LA
#8     4   Jake  33  blue   NJ

An alternative, slightly simpler approach:

temp2 <- unique(df[,c('name','age','city')])
keepers2 <- names(which(tapply(temp2$name, temp2$name, length)==1))

df[df$name %in% keepers2, ]
#  group   name age color city
#4     2 Martin  78 black   LA
#5     2 Martin  78  blue   LA
#8     4   Jake  33  blue   NJ

+2

Jota 23 Aug 14 at 22:11

source to share

Here's an approach using dplyr:

df <- read.table(text = "
  group   name    age    color    city
  1       Anton   50     orange   NY
  1       Anton   21     red      NY
  1       Anton   21     red      NJ
  2       Martin  78     black    LA
  2       Martin  78     blue     LA
  3       Maria   29     red      NC
  3       Maria   29     pink     LV
  4       Jake    33     blue     NJ 
", header = TRUE)

library(dplyr)
df %>% 
  group_by(group) %>%
  filter(n_distinct(age) == 1 && n_distinct(city) == 1)

I think it's pretty easy to figure out what's going on - you group and then filter to keep the groups when there is only one distinct age and city.

+2

hadley 26 Aug 14 at 12:40

source to share

user20650 · Accepted Answer · 2014-08-23T22:46:25+0000

A similar approach to Franks, you can count length

combinations unique

age

and city

on group

- do it using ave

. Then you can multiply your data if the length of unique combinations is more than one

# your data

df <- read.table(text="group   name    age    color    city
1       Anton   50     orange   NY
1       Anton   21     red      NY
1       Anton   21     red      NJ
2       Martin  78     black    LA
2       Martin  78     blue     LA
3       Maria   29     red      NC
3       Maria   29     pink     LV
4       Jake    33     blue     NJ ", header=T)

# calculate and subset

df[with(df, ave(paste(age, city), group, FUN=function(x) length(unique(x))))==1,]

#   group   name age color city
# 4     2 Martin  78 black   LA
# 5     2 Martin  78  blue   LA
# 8     4   Jake  33  blue   NJ

Remove cases from DF if they are duplicated in certain columns, while other columns should be different

More articles: