Creating a unique identifier variable as a combination of variables
I have a data frame ( df
) or data table ( dt
) with say 1000 variables and 1000 observations. I have verified that there are no duplicates in the observations, so dt[!duplicated(dt)]
is the same length as the original file.
I would like to create an ID variable for this whole observation with a combination of some of the 1000 variables I have. Contrary to other SO questions, as I don't know which variables are more suitable for creating an identifier and it is likely that I need a combination of at least 3 or 4 variables.
Is there any package / function in R that could get the most efficient combination of variables to create an ID variable? In my real life example, I am struggling to create an ID manually and it is probably not the best combination of variables.
Example with mtcars:
require(data.table)
example <- data.table(mtcars)
rownames(example) <- NULL # Delete mtcars row names
example <- example[!duplicated(example),]
example[,id_var_wrong := paste0(mpg,"_",cyl)]
length(unique(example$id_var_wrong)) # Wrong ID, there are only 27 different values for this variable despite 32 observations
example[,id_var_good := paste0(wt,"_",qsec)]
length(unique(example$id_var_good)) # Good ID as there are equal number of unique values as different observations.
Is there any function to search wt
and qsec
automatically rather than manually?
source to share
Homemade algorithm: the principle is to greedily accept a variable with very different number of elements, and then filter only the remaining rows with duplicates and iterations. It doesn't provide a better solution, but it's an easy way to get a pretty good solution quickly.
set.seed(1)
mat <- replicate(1000, sample(c(letters, LETTERS), size = 100, replace = TRUE))
library(dplyr)
columnsID <- function(mat) {
df <- df0 <- as_data_frame(mat)
vars <- c()
while(nrow(df) > 0) {
var_best <- names(which.max(lapply(df, n_distinct)))[[1]]
vars <- append(vars, var_best)
df <- group_by_at(df0, vars) %>% filter(n() > 1)
}
vars
}
columnsID(mat)
[1] "V68" "V32"
source to share
In many cases, there is a natural key that uniquely identifies each observation. For example, a dataset mtcars
has unique row names.
library(data.table)
data.table(mtcars, keep.rownames = "id")
id mpg cyl disp hp drat wt qsec vs am gear carb
1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3: Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5: Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6: Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
...
If there is no natural key, I suggest creating a critical key by simply numbering the rows sequentially and storing it in an additional column:
data.table(mtcars)[, rn := .I][]
mpg cyl disp hp drat wt qsec vs am gear carb rn
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
...
Anything else may not be worth the effort or risk that the attribute values ββmay become identical, for example when they are rounded.
source to share