Removing rows in data.frame with columns included in others
I am trying to achieve something similar unique
in data.frame where column every column item in a row is a vector. What I want to do is if the elements of the vector in the column of that hat row are a subset of or equal to another, remove the row with fewer elements. I can achieve this with a nested loop for
, but since the data contains 400,000 rows, the program is very inefficient.
Sample data
# Set the seed for reproducibility
set.seed(42)
# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20),
grps = sample(1:5, 20, replace = TRUE),
supergrp = sample(LETTERS[1:4], replace = TRUE))
# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)
# Arrange by number of items for each grp and supergroup
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]
Temp looks like
grps supergrp items
1 4 D a, c, d
2 3 D c, d
3 5 D a, d
4 1 A b
5 2 A b
6 3 A b
7 4 A b
8 5 A b
9 1 D d
10 2 D c
Now you can see that the second combination of supergraph and elements in the second and third lines is contained in the first line. So, I want to remove the second and third rows from the result. Likewise, lines 5-8 are on line 4. Finally, lines 9 and 10 are on the first line, so I want to delete lines 9 and 10. So my result will look like this:
grps supergrp items
1 4 D a, c, d
4 1 A b
My implementation looks like this:
# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]
# For all rows in the the original data
for(i in 1:nrow(temp))
{
# Index to check if all the items are found
indx <- TRUE
# Check if item in the original data appears in the new data
for(j in 1:nrow(newdf))
{
if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in%
c(newdf$supergrp[[j]], newdf$items[[j]]))){
# set indx to false if a row with same items and supergroup
# as the old data is found in the new data
indx <- FALSE
}
}
# If none of the rows in new data contain items and supergroup in old data append that
if(indx){
newdf <- rbind(newdf, temp[i, ])
}
}
I believe there is an efficient way to do this in R; can use frames tidy
and dplyr
but I am missing a trick. Apologies for the long-standing question. Any input would be much appreciated.
source to share
I would try to get the items from the list column and store them in a longer dataframe. Here's my somewhat hacky solution:
library(stringr)
items <- temp$items %>%
map(~str_split(., ",")) %>%
map_df(~data.frame(.))
out <- bind_cols(temp[, c("grps", "supergrp")], items)
out %>%
gather(item_name, item, -grps, -supergrp) %>%
select(-item_name, -grps) %>%
unique() %>%
filter(!is.na(item))
source to share