Creating a unique identifier variable as a combination of variables

I have a data frame ( df

) or data table ( dt

) with say 1000 variables and 1000 observations. I have verified that there are no duplicates in the observations, so dt[!duplicated(dt)]

is the same length as the original file.

I would like to create an ID variable for this whole observation with a combination of some of the 1000 variables I have. Contrary to other SO questions, as I don't know which variables are more suitable for creating an identifier and it is likely that I need a combination of at least 3 or 4 variables.

Is there any package / function in R that could get the most efficient combination of variables to create an ID variable? In my real life example, I am struggling to create an ID manually and it is probably not the best combination of variables.

Example with mtcars:

require(data.table)
example <- data.table(mtcars)
rownames(example) <- NULL # Delete mtcars row names
example <- example[!duplicated(example),]
example[,id_var_wrong := paste0(mpg,"_",cyl)]
length(unique(example$id_var_wrong)) # Wrong ID, there are only 27 different values for this variable despite 32 observations

example[,id_var_good := paste0(wt,"_",qsec)]
length(unique(example$id_var_good)) # Good ID as there are equal number of unique values as different observations.

      

Is there any function to search wt

and qsec

automatically rather than manually?

+3


source to share


2 answers


Homemade algorithm: the principle is to greedily accept a variable with very different number of elements, and then filter only the remaining rows with duplicates and iterations. It doesn't provide a better solution, but it's an easy way to get a pretty good solution quickly.



set.seed(1)
mat <- replicate(1000, sample(c(letters, LETTERS), size = 100, replace = TRUE))

library(dplyr)

columnsID <- function(mat) {
  df <- df0 <- as_data_frame(mat)
  vars <- c()
  while(nrow(df) > 0) {
    var_best <- names(which.max(lapply(df, n_distinct)))[[1]]
    vars <- append(vars, var_best)
    df <- group_by_at(df0, vars) %>% filter(n() > 1)
  }
  vars
}

columnsID(mat)
[1] "V68" "V32"

      

+1


source


In many cases, there is a natural key that uniquely identifies each observation. For example, a dataset mtcars

has unique row names.

library(data.table)
data.table(mtcars, keep.rownames = "id")

      

                     id  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
 3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 4:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
 5:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 6:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 ...

      

If there is no natural key, I suggest creating a critical key by simply numbering the rows sequentially and storing it in an additional column:



data.table(mtcars)[, rn := .I][]

      

     mpg cyl  disp  hp drat    wt  qsec vs am gear carb rn
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  1
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  2
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  3
 4: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  4
 5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  5
 6: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  6
 ...

      

Anything else may not be worth the effort or risk that the attribute values ​​may become identical, for example when they are rounded.

+1


source







All Articles