Count with commas separates unique values ​​in a string

The first two columns of the dataframe form a composite key and there is a char column that contains comma-separated integers. My goal is to create a column that contains the number of unique integers in a row. I know an approach of converting a row to columns using str_split_fixed and then counting the unique values, but because of the length of the row, a lot of columns have been added and everything is lagging. Is there any other method? The actual dataset contains 500K rows and 53 columns. Example dataset:
DF

c1      c2    c3  
aa      11   1,13,4,5,4,7,9    
bb      22   2,5,2,4,5,7,11,     
cc      33   11,14,3,1,    
dd      44   1,1,2,4,5,6,15,    
ee      55   4,3,3,1,14,17,

      

desired result:

c1        c2             c3             c4  
------ | ------   | ------          | -----   
aa     | 11       | 1,13,4,5,4,7,9  |  6    
------ | ------   | ------          | -----   
bb     | 22       | 2,5,2,4,5,7,11, |  5   
------ | ------   | ------          | -----   
cc     | 33       | 11,14,3,1,      |  4   
------ | ------   | ------          | -----   
dd     | 44       | 1,1,2,4,5,6,15, |  6       
------ | ------   | ------          | -----   
ee     | 55       | 4,3,3,1,7,17,7, |  5    
------ | ------   | ------          | -----  

      

Any help would be appreciated!

+3


source to share


3 answers


We can use stri_extract

to fetch all numbers, then loop through list

, find length

elementsunique



library(stringi)
df1$Count <- sapply(stri_extract_all_regex(df1$col3, "[0-9]+"), 
                     function(x) length(unique(x)))

      

+4


source


Using strsplit

with uniqueN

from data.table

-package:

df$c4 <- sapply(strsplit(df$c3,','), uniqueN)

      

which gives:

> df
  c1 c2              c3 c4
1 aa 11  1,13,4,5,4,7,9  6
2 bb 22 2,5,2,4,5,7,11,  5
3 cc 33      11,14,3,1,  4
4 dd 44 1,1,2,4,5,6,15,  6
5 ee 55  4,3,3,1,14,17,  5

      

NOTE: if df$c3

is a factor variable, wrap it in as.character

:sapply(strsplit(as.character(df$c3), ','), uniqueN)




Another R base alternative to create df$c4

:

sapply(regmatches(df$c3, gregexpr('\\d+', df$c3)), function(x) length(unique(x)))

      




A tidyverse

alternative:

library(dplyr)
library(tidyr)
df %>% 
  separate_rows(c3) %>% 
  filter(c3 != '') %>% 
  group_by(c1) %>% 
  summarise(c4 = n_distinct(c3)) %>% 
  left_join(df, .)

      

+4


source


Assuming your df is something like below:

df <- c("1,13,4,5,4,7,9,", "2,5,2,4,5,7,11,","11,14,3,1,4,"," 1,1,2,4,5,6,15,","4,6,3,3,1,14,17,14,")
df <- gsub("\\s+|,$","",df) ##Removal of unnecssary spaces and trailing commas

      

then you can use baseR :

unlist(lapply(strsplit(df,split=","),function(x)length(unique(x))))

      

The result will be something like this:

[1] 6 5 5 6 6

      

+3


source







All Articles