Finding unique tuples in R but ignoring order

Since my data is much more complex, I made a smaller sample dataset (I left the change to show how I generated the data).

set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)

      

that leads to...

> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
  Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D

      

Now I essentially want to use the unique but ignore ordering function to find where the first 3 elements are unique.

So in this case:

I would have A, B, C on line 5

I would have A, B, D on lines 1 and 3

I would have A, C, D on lines 2 and 4

Also I need to count these "unique" events

Also 2 more things. First, my values ​​are strings and I need to leave them as strings. Second, if possible, I would have a column between year and 1 called Weighting, and then when calculating these unique combinations, I would include each weighting. This is not that important because all the scales will be small positive integer values, so I can potentially duplicate lines earlier to account for the weightings and then tabulate the unique pairs.

+3


source to share


1 answer


You can do something like this:

df <- dcast(temp_df, Year ~ Rank)

combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))

combos
#     1     2     3     4     5 
# "BCD" "ABC" "ACD" "BCD" "ABC" 

      

For each row of the data frame, the values ​​in columns 1, 2, and 3 (as labeled in the column) are sorted using sort

and then concatenated using paste0

. Since the order doesn't matter, this ensures identical cases are consistently flagged.



Note that the function is paste0

equivalent paste(..., sep = "")

. The argument collapse

indicates to concatenate the vector values ​​into a single string with the vector values ​​separated by the value passed to collapse

. In this case, we are setting collapse = ""

which means no separation between values, resulting in "ABC"

, "ACD"

etc.

Then you can get the score of each combination with table

:

table(combos)
# ABC ACD BCD 
#   2   1   2 

      

+5


source







All Articles