Removing rows in data.frame with columns included in others

I am trying to achieve something similar unique

in data.frame where column every column item in a row is a vector. What I want to do is if the elements of the vector in the column of that hat row are a subset of or equal to another, remove the row with fewer elements. I can achieve this with a nested loop for

, but since the data contains 400,000 rows, the program is very inefficient.

Sample data

# Set the seed for reproducibility 
set.seed(42)

# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20), 
                   grps = sample(1:5, 20, replace = TRUE),
                   supergrp =  sample(LETTERS[1:4], replace = TRUE))


# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)

# Arrange by number of items for each grp and supergroup 
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]

      

Temp looks like

       grps supergrp   items
    1     4        D a, c, d
    2     3        D    c, d
    3     5        D    a, d
    4     1        A       b
    5     2        A       b
    6     3        A       b
    7     4        A       b
    8     5        A       b
    9     1        D       d
   10     2        D       c

      

Now you can see that the second combination of supergraph and elements in the second and third lines is contained in the first line. So, I want to remove the second and third rows from the result. Likewise, lines 5-8 are on line 4. Finally, lines 9 and 10 are on the first line, so I want to delete lines 9 and 10. So my result will look like this:

      grps supergrp   items
    1    4        D a, c, d
    4    1        A       b 

      

My implementation looks like this:

# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]

# For all rows in the the original data
for(i in 1:nrow(temp))
{
  # Index to check if all the items are found 
  indx <- TRUE

  # Check if item in the original data appears in the new data
  for(j in 1:nrow(newdf))
  {
   if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in% 
          c(newdf$supergrp[[j]], newdf$items[[j]]))){
     # set indx to false if a row with same items and supergroup  
     # as the old data is found in the new data
    indx <- FALSE
   }
  }

  # If none of the rows in new data contain items and supergroup in old data append that
  if(indx){
    newdf <- rbind(newdf, temp[i, ])
  }
}

      

I believe there is an efficient way to do this in R; can use frames tidy

and dplyr

but I am missing a trick. Apologies for the long-standing question. Any input would be much appreciated.

+3


source to share


1 answer


I would try to get the items from the list column and store them in a longer dataframe. Here's my somewhat hacky solution:



library(stringr)

items <- temp$items %>% 
    map(~str_split(., ",")) %>% 
    map_df(~data.frame(.))

out <- bind_cols(temp[, c("grps", "supergrp")], items)

out %>% 
    gather(item_name, item, -grps, -supergrp) %>% 
    select(-item_name, -grps) %>% 
    unique() %>% 
    filter(!is.na(item))

      

+1


source







All Articles