Is there an easy way to connect unique data points in a data frame?

I want to extract pairs of data from a dataframe where they are connected to data that is not in their own column. Each number in column 1 is paired with all numbers to the right of that column. Likewise, numbers in column 2 are linked only to numbers in columns 3 or higher.

I've created a script that does this using a bird's nest for 'for' loops, but I believe there should be a more elegant way to do this.

Sample data:

structure(list(A = 1:3, B = 4:6, C = 7:9), .Names = c("A", "B", 
          "C"), class = "data.frame", row.names = c(NA, -3L))

      

Desired output:

structure(list(X1 = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 
          3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6), X2 = c(4, 5, 6, 7, 
          8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 7, 8, 9, 7, 8, 9, 7, 
          8, 9)), .Names = c("X1", "X2"), row.names = c(NA, 27L), class = "data.frame")

      

+3


source to share


3 answers


Here's an approach using a package data.table

and its very efficient features CJ

and rbindlist

(assuming your dataset is named df

)

library(data.table)
res <- rbindlist(lapply(seq_len(length(df) - 1), 
        function(i) CJ(df[, i], unlist(df[, -(seq_len(i))]))))

      

Then you can specify the column names by reference (if you insist on "X1" and "X2") using setnames



setnames(res, 1:2, c("X1", "X2"))

      

You can also convert back to data.frame

by reference (if you want to specify exactly what you want ") withsetDF()

setDF(res)

      

+3


source


Here df

is the entrancedataset



out1 <- do.call(rbind,lapply(1:(ncol(df)-1), function(i) {
               x1 <- df[,i:(ncol(df))]
               Un1 <-unique(unlist(x1[,-1]))
           data.frame(X1=rep(x1[,1], each=length(Un1)), X2= Un1)}))

 all.equal(out, out1) #if `out` is the expected output
 #[1] TRUE

      

+1


source


Another approach:

res <- do.call(rbind, unlist(lapply(seq(ncol(dat) - 1), function(x) 
  lapply(seq(x + 1, ncol(dat)), function(y) 
    "names<-"(expand.grid(dat[c(x, y)]), c("X1", "X2")))),
  recursive = FALSE))

      

where dat

is the name of your dataframe.

You can sort the result with this command:

res[order(res[[1]], res[[2]]), ]

      

+1


source







All Articles