Finding unique tuples in R but ignoring order
Since my data is much more complex, I made a smaller sample dataset (I left the change to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
that leads to...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use the unique but ignore ordering function to find where the first 3 elements are unique.
So in this case:
I would have A, B, C on line 5
I would have A, B, D on lines 1 and 3
I would have A, C, D on lines 2 and 4
Also I need to count these "unique" events
Also 2 more things. First, my values are strings and I need to leave them as strings. Second, if possible, I would have a column between year and 1 called Weighting, and then when calculating these unique combinations, I would include each weighting. This is not that important because all the scales will be small positive integer values, so I can potentially duplicate lines earlier to account for the weightings and then tabulate the unique pairs.
source to share
You can do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the column) are sorted using sort
and then concatenated using paste0
. Since the order doesn't matter, this ensures identical cases are consistently flagged.
Note that the function is paste0
equivalent paste(..., sep = "")
. The argument collapse
indicates to concatenate the vector values into a single string with the vector values separated by the value passed to collapse
. In this case, we are setting collapse = ""
which means no separation between values, resulting in "ABC"
, "ACD"
etc.
Then you can get the score of each combination with table
:
table(combos)
# ABC ACD BCD
# 2 1 2
source to share